Paperid: 1, https://arxiv.org/pdf/2502.21321.pdf   GitHub GitHub
Authors:Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan
Title: LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Abstract:
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.
中文摘要:本综述系统探讨了大型语言模型的后训练方法,这些方法在预训练基础上通过增强推理能力、事实准确性和伦理对齐来优化模型性能,同时解决了灾难性遗忘等关键挑战并展望了未来研究方向。
English Summary: This survey systematically examines post-training methods that refine Large Language Models beyond pretraining by enhancing reasoning, factual accuracy, and ethical alignment, while addressing challenges like catastrophic forgetting and outlining future research directions.

Authors:Dingyi Zhang, Deyu Zhou
Title: Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind
Abstract:
Persuasive dialogue plays a pivotal role in human communication, influencing various domains. Recent persuasive dialogue datasets often fail to align with real-world interpersonal interactions, leading to unfaithful representations. For instance, unrealistic scenarios may arise, such as when the persuadee explicitly instructs the persuader on which persuasion strategies to employ, with each of the persuadee's questions corresponding to a specific strategy for the persuader to follow. This issue can be attributed to a violation of the "Double Blind" condition, where critical information is fully shared between participants. In actual human interactions, however, key information such as the mental state of the persuadee and the persuasion strategies of the persuader is not directly accessible. The persuader must infer the persuadee's mental state using Theory of Mind capabilities and construct arguments that align with the persuadee's motivations. To address this gap, we introduce ToMMA, a novel multi-agent framework for dialogue generation that is guided by causal Theory of Mind. This framework ensures that information remains undisclosed between agents, preserving "double-blind" conditions, while causal ToM directs the persuader's reasoning, enhancing alignment with human-like persuasion dynamics. Consequently, we present CToMPersu, a multi-domain, multi-turn persuasive dialogue dataset that tackles both double-blind and logical coherence issues, demonstrating superior performance across multiple metrics and achieving better alignment with real human dialogues. Our dataset and prompts are available at https://github.com/DingyiZhang/ToMMA-CToMPersu .
Chinese Summary: ToMMA框架采用因果心智理论构建多智能体说服对话系统,通过保持双盲条件增强真实性,其CToMPersu数据集在模拟人类对话方面优于现有基准。
English Summary: The ToMMA framework introduces a multi-agent persuasive dialogue system using causal Theory of Mind to maintain double-blind conditions and improve realism, accompanied by the CToMPersu dataset that outperforms existing benchmarks in mimicking human interactions.

Authors:Xueyun Tian, Wei Li, Bingbing Xu, Yige Yuan, Yuanzhuo Wang, Huawei Shen
Title: MIGE: Mutually Enhanced Multimodal Instruction-Based Image Generation and Editing
Abstract:
Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Inspired by this, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It first treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation, then introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism. This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: by leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a SOTA in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.
Chinese: MIGE是一个通过多模态指令统一处理主体驱动生成和指令编辑的框架,联合训练提升了任务表现并实现了跨任务泛化能力。
English: MIGE is a unified framework that standardizes subject-driven generation and instruction-based editing through multimodal instructions, enabling joint training for enhanced performance and generalization across tasks.

Authors:Menghua Wu, Russell Littman, Jacob Levine, Lin Qiu, Tommaso Biancalani, David Richmond, Jan-Christian Huetter
Title: Contextualizing biological perturbation experiments through language
Abstract:
High-content perturbation experiments allow scientists to probe biomolecular systems at unprecedented resolution, but experimental and analysis costs pose significant barriers to widespread adoption. Machine learning has the potential to guide efficient exploration of the perturbation space and extract novel insights from these data. However, current approaches neglect the semantic richness of the relevant biology, and their objectives are misaligned with downstream biological analyses. In this paper, we hypothesize that large language models (LLMs) present a natural medium for representing complex biological relationships and rationalizing experimental outcomes. We propose PerturbQA, a benchmark for structured reasoning over perturbation experiments. Unlike current benchmarks that primarily interrogate existing knowledge, PerturbQA is inspired by open problems in perturbation modeling: prediction of differential expression and change of direction for unseen perturbations, and gene set enrichment. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR, a simple, domain-informed LLM framework that matches or exceeds the current state-of-the-art. Our code and data are publicly available at https://github.com/genentech/PerturbQA.
中文: PerturbQA是一个专为提升扰动实验中结构化推理能力而设计的新基准,通过利用大型语言模型更好地捕捉生物学语义,解决了现有方法在预测未知扰动方面的不足,并显著提高了分析准确性。
English: PerturbQA is a new benchmark designed to enhance machine learning models' structured reasoning in perturbation experiments, addressing current limitations by leveraging large language models to better capture biological semantics and improve predictive accuracy for unseen perturbations.

Authors:Li Yang, Mirna El Rajab, Abdallah Shami, Sami Muhaidat
Title: Enabling AutoML for Zero-Touch Network Security: Use-Case Driven Analysis
Abstract:
Zero-Touch Networks (ZTNs) represent a state-of-the-art paradigm shift towards fully automated and intelligent network management, enabling the automation and intelligence required to manage the complexity, scale, and dynamic nature of next-generation (6G) networks. ZTNs leverage Artificial Intelligence (AI) and Machine Learning (ML) to enhance operational efficiency, support intelligent decision-making, and ensure effective resource allocation. However, the implementation of ZTNs is subject to security challenges that need to be resolved to achieve their full potential. In particular, two critical challenges arise: the need for human expertise in developing AI/ML-based security mechanisms, and the threat of adversarial attacks targeting AI/ML models. In this survey paper, we provide a comprehensive review of current security issues in ZTNs, emphasizing the need for advanced AI/ML-based security mechanisms that require minimal human intervention and protect AI/ML models themselves. Furthermore, we explore the potential of Automated ML (AutoML) technologies in developing robust security solutions for ZTNs. Through case studies, we illustrate practical approaches to securing ZTNs against both conventional and AI/ML-specific threats, including the development of autonomous intrusion detection systems and strategies to combat Adversarial ML (AML) attacks. The paper concludes with a discussion of the future research directions for the development of ZTN security approaches.
中文: 零接触网络(ZTNs)利用人工智能和机器学习实现网络管理自动化,但面临安全挑战,包括开发AI/ML安全机制需人力参与及对抗性攻击威胁,本综述通过回顾现有问题并探索AutoML技术来寻求解决方案。
English: Zero-Touch Networks (ZTNs) utilize AI and ML to automate network management but face security challenges, including the need for human expertise in developing AI/ML security mechanisms and threats from adversarial attacks, which this survey addresses by reviewing current issues and exploring AutoML for robust solutions.

Authors:Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, Qixiang Ye
Title: Adaptive Keyframe Sampling for Long Video Understanding
Abstract:
Multimodal large language models (MLLMs) have enabled open-world visual understanding by injecting visual input as extra tokens into large language models (LLMs) as contexts. However, when the visual input changes from a single image to a long video, the above paradigm encounters difficulty because the vast amount of video tokens has significantly exceeded the maximal capacity of MLLMs. Therefore, existing video-based MLLMs are mostly established upon sampling a small portion of tokens from input data, which can cause key information to be lost and thus produce incorrect answers. This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS). It inserts a plug-and-play module known as keyframe selection, which aims to maximize the useful information with a fixed number of video tokens. We formulate keyframe selection as an optimization involving (1) the relevance between the keyframes and the prompt, and (2) the coverage of the keyframes over the video, and present an adaptive algorithm to approximate the best solution. Experiments on two long video understanding benchmarks validate that Adaptive Keyframe Sampling improves video QA accuracy (beyond strong baselines) upon selecting informative keyframes. Our study reveals the importance of information pre-filtering in video-based MLLMs. Code is available at https://github.com/ncTimTang/AKS.
中文: 本文提出自适应关键帧采样算法(AKS),通过优化关键帧与提示的相关性及视频覆盖范围,有效提升多模态大语言模型在长视频理解中的问答准确率。
English: This paper introduces Adaptive Keyframe Sampling (AKS), a plug-and-play module that optimizes video token selection by balancing relevance to prompts and video coverage, thereby enhancing long video understanding accuracy in multimodal large language models.

Authors:Aleksandr Nesterov, Andrey Sakhovskiy, Ivan Sviridov, Airat Valiev, Vladimir Makharev, Petr Anokhin, Galina Zubkova, Elena Tutubalina
Title: RuCCoD: Towards Automated ICD Coding in Russian
Abstract:
This study investigates the feasibility of automating clinical coding in Russian, a language with limited biomedical resources. We present a new dataset for ICD coding, which includes diagnosis fields from electronic health records (EHRs) annotated with over 10,000 entities and more than 1,500 unique ICD codes. This dataset serves as a benchmark for several state-of-the-art models, including BERT, LLaMA with LoRA, and RAG, with additional experiments examining transfer learning across domains (from PubMed abstracts to medical diagnosis) and terminologies (from UMLS concepts to ICD codes). We then apply the best-performing model to label an in-house EHR dataset containing patient histories from 2017 to 2021. Our experiments, conducted on a carefully curated test set, demonstrate that training with the automated predicted codes leads to a significant improvement in accuracy compared to manually annotated data from physicians. We believe our findings offer valuable insights into the potential for automating clinical coding in resource-limited languages like Russian, which could enhance clinical efficiency and data accuracy in these contexts. Our code and dataset are available at https://github.com/auto-icd-coding/ruccod.
本研究通过利用新标注的数据集和迁移学习技术,证明在资源有限的俄语中,采用先进模型的自动化临床编码相比人工方法能显著提高准确性。
This study demonstrates that automated clinical coding using advanced models significantly improves accuracy over manual methods for Russian, a language with limited biomedical resources, by leveraging a new annotated dataset and transfer learning techniques.

Authors:Maria Koshkina, James H. Elder
Title: Towards long-term player tracking with graph hierarchies and domain-specific features
Abstract:
In team sports analytics, long-term player tracking remains a challenging task due to player appearance similarity, occlusion, and dynamic motion patterns. Accurately re-identifying players and reconnecting tracklets after extended absences from the field of view or prolonged occlusions is crucial for robust analysis. We introduce SportsSUSHI, a hierarchical graph-based approach that leverages domain-specific features, including jersey numbers, team IDs, and field coordinates, to enhance tracking accuracy. SportsSUSHI achieves high performance on the SoccerNet dataset and a newly proposed hockey tracking dataset. Our hockey dataset, recorded using a stationary camera capturing the entire playing surface, contains long sequences and annotations for team IDs and jersey numbers, making it well-suited for evaluating long-term tracking capabilities. The inclusion of domain-specific features in our approach significantly improves association accuracy, as demonstrated in our experiments. The dataset and code are available at https://github.com/mkoshkina/sports-SUSHI.
中文摘要:SportsSUSHI提出了一种基于分层图的方法,利用球衣号码和队伍标识等特定领域特征来提升团队运动中球员的长期追踪准确性,在足球和冰球数据集上均表现出优异性能。
English Summary: SportsSUSHI introduces a hierarchical graph-based method using domain-specific features like jersey numbers and team IDs to improve long-term player tracking in team sports, demonstrating high accuracy on soccer and hockey datasets.

Authors:Zihan Huang, Xinyu Shi, Zecheng Hao, Tong Bu, Jianhao Ding, Zhaofei Yu, Tiejun Huang
Title: Towards High-performance Spiking Transformers from ANN to SNN Conversion
Abstract:
Spiking neural networks (SNNs) show great potential due to their energy efficiency, fast processing capabilities, and robustness. There are two main approaches to constructing SNNs. Direct training methods require much memory, while conversion methods offer a simpler and more efficient option. However, current conversion methods mainly focus on converting convolutional neural networks (CNNs) to SNNs. Converting Transformers to SNN is challenging because of the presence of non-linear modules. In this paper, we propose an Expectation Compensation Module to preserve the accuracy of the conversion. The core idea is to use information from the previous T time-steps to calculate the expected output at time-step T. We also propose a Multi-Threshold Neuron and the corresponding Parallel Parameter normalization to address the challenge of large time steps needed for high accuracy, aiming to reduce network latency and power consumption. Our experimental results demonstrate that our approach achieves state-of-the-art performance. For example, we achieve a top-1 accuracy of 88.60\% with only a 1\% loss in accuracy using 4 time steps while consuming only 35\% of the original power of the Transformer. To our knowledge, this is the first successful Artificial Neural Network (ANN) to SNN conversion for Spiking Transformers that achieves high accuracy, low latency, and low power consumption on complex datasets. The source codes of the proposed method are available at https://github.com/h-z-h-cell/Transformer-to-SNN-ECMT.
中文: 本文提出期望补偿模块和多阈值神经元,有效将Transformer转换为脉冲神经网络,在保持高精度的同时实现了低延迟和低功耗。
English: This paper introduces an Expectation Compensation Module and a Multi-Threshold Neuron to efficiently convert Transformers into Spiking Neural Networks, achieving high accuracy with low latency and power consumption.

Authors:Baiting Luo, Ava Pettet, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay
Title: Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
Abstract:
Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present Latent Macro Action Planner (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on-par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.
中文: 潜在宏动作规划器(L-MAP)通过将连续动作空间离散化为潜在宏动作,并利用蒙特卡洛树搜索处理环境随机性,在离线强化学习的随机控制任务中显著优于现有方法且保持低决策延迟。
English: The Latent Macro Action Planner (L-MAP) addresses computational challenges in stochastic, high-dimensional continuous action spaces by learning temporally extended macro-actions and employing Monte Carlo tree search for efficient planning, significantly outperforming existing methods in offline reinforcement learning tasks.

Authors:Zijian Kang, Yueyang Li, Shengyu Gong, Weiming Zeng, Hongjie Yan, Lingbin Bian, Zhiguo Zhang, Wai Ting Siok, Nizhuan Wang
Title: Hypergraph Multi-Modal Learning for EEG-based Emotion Recognition in Conversation
Abstract:
Emotional Recognition in Conversation (ERC) is valuable for diagnosing health conditions such as autism and depression, and for understanding the emotions of individuals who struggle to express their feelings. Current ERC methods primarily rely on semantic, audio and visual data but face significant challenges in integrating physiological signals such as Electroencephalography (EEG). This research proposes Hypergraph Multi-Modal Learning (Hyper-MML), a novel framework for identifying emotions in conversation. Hyper-MML effectively integrates EEG with audio and video information to capture complex emotional dynamics. Firstly, we introduce an Adaptive Brain Encoder with Mutual-cross Attention (ABEMA) module for processing EEG signals. This module captures emotion-relevant features across different frequency bands and adapts to subject-specific variations through hierarchical mutual-cross attention mechanisms. Secondly, we propose an Adaptive Hypergraph Fusion Module (AHFM) to actively model the higher-order relationships among multi-modal signals in ERC. Experimental results on the EAV and AFFEC datasets demonstrate that our Hyper-MML model significantly outperforms current state-of-the-art methods. The proposed Hyper-MML can serve as an effective communication tool for healthcare professionals, enabling better engagement with patients who have difficulty expressing their emotions. The official implementation codes are available at https://github.com/NZWANG/Hyper-MML.
中文: 本研究提出超图多模态学习框架,通过自适应脑电编码器和融合模块整合脑电与视听信息,显著提升了对话情绪识别的性能,在医疗辅助沟通中具有应用潜力。
English: This study introduces Hypergraph Multi-Modal Learning (Hyper-MML), a novel framework that integrates EEG with audio-visual data through adaptive modules to enhance emotion recognition in conversations, demonstrating superior performance on benchmark datasets and potential healthcare applications.

Authors:Yunfan Lu, Xiaogang Xu, Hao Lu, Yanlin Qian, Pengteng Li, Huizai Yao, Bin Yang, Junyi Li, Qianyi Cai, Weiyu Guo, Hui Xiong
Title: SEE: See Everything Every Time -- Adaptive Brightness Adjustment for Broad Light Range Images via Events
Abstract:
Event cameras, with a high dynamic range exceeding $120dB$, significantly outperform traditional embedded cameras, robustly recording detailed changing information under various lighting conditions, including both low- and high-light situations. However, recent research on utilizing event data has primarily focused on low-light image enhancement, neglecting image enhancement and brightness adjustment across a broader range of lighting conditions, such as normal or high illumination. Based on this, we propose a novel research question: how to employ events to enhance and adaptively adjust the brightness of images captured under broad lighting conditions? To investigate this question, we first collected a new dataset, SEE-600K, consisting of 610,126 images and corresponding events across 202 scenarios, each featuring an average of four lighting conditions with over a 1000-fold variation in illumination. Subsequently, we propose a framework that effectively utilizes events to smoothly adjust image brightness through the use of prompts. Our framework captures color through sensor patterns, uses cross-attention to model events as a brightness dictionary, and adjusts the image's dynamic range to form a broad light-range representation (BLR), which is then decoded at the pixel level based on the brightness prompt. Experimental results demonstrate that our method not only performs well on the low-light enhancement dataset but also shows robust performance on broader light-range image enhancement using the SEE-600K dataset. Additionally, our approach enables pixel-level brightness adjustment, providing flexibility for post-processing and inspiring more imaging applications. The dataset and source code are publicly available at:https://github.com/yunfanLu/SEE.
中文: 事件相机虽能在多种光照下捕捉细节,但研究多集中于弱光增强,为此我们提出了新框架和数据集,以实现广泛光照条件下的自适应亮度调整。
English: Event cameras excel in capturing details across varied lighting but are underutilized beyond low-light enhancement, prompting the development of a framework and dataset for adaptive brightness adjustment in broad conditions.

Authors:Yunfan Lu, Xiaogang Xu, Hao Lu, Yanlin Qian, Pengteng Li, Huizai Yao, Bin Yang, Junyi Li, Qianyi Cai, Weiyu Guo, Hui Xiong
Title: SEE: See Everything Every Time -- Adaptive Brightness Adjustment for Broad Light Range Images via Events
Abstract:
Event cameras, with a high dynamic range exceeding $120dB$, significantly outperform traditional embedded cameras, robustly recording detailed changing information under various lighting conditions, including both low- and high-light situations. However, recent research on utilizing event data has primarily focused on low-light image enhancement, neglecting image enhancement and brightness adjustment across a broader range of lighting conditions, such as normal or high illumination. Based on this, we propose a novel research question: how to employ events to enhance and adaptively adjust the brightness of images captured under broad lighting conditions? To investigate this question, we first collected a new dataset, SEE-600K, consisting of 610,126 images and corresponding events across 202 scenarios, each featuring an average of four lighting conditions with over a 1000-fold variation in illumination. Subsequently, we propose a framework that effectively utilizes events to smoothly adjust image brightness through the use of prompts. Our framework captures color through sensor patterns, uses cross-attention to model events as a brightness dictionary, and adjusts the image's dynamic range to form a broad light-range representation (BLR), which is then decoded at the pixel level based on the brightness prompt. Experimental results demonstrate that our method not only performs well on the low-light enhancement dataset but also shows robust performance on broader light-range image enhancement using the SEE-600K dataset. Additionally, our approach enables pixel-level brightness adjustment, providing flexibility for post-processing and inspiring more imaging applications. The dataset and source code are publicly available at: https://github.com/yunfanLu/SEE.
中文: 事件相机虽能在多种光照下捕捉细节,但研究多集中于弱光增强,为此我们提出了新框架和数据集,以实现广泛光照条件下的自适应亮度调整。
English: Event cameras excel in capturing details across varied lighting but are underutilized beyond low-light enhancement, prompting the development of a framework and dataset for adaptive brightness adjustment in broad conditions.

Authors:Marina D'Amato, Jeroen van der Laak, Francesco Ciompi
Title: "No negatives needed": weakly-supervised regression for interpretable tumor detection in whole-slide histopathology images
Abstract:
Accurate tumor detection in digital pathology whole-slide images (WSIs) is crucial for cancer diagnosis and treatment planning. Multiple Instance Learning (MIL) has emerged as a widely used approach for weakly-supervised tumor detection with large-scale data without the need for manual annotations. However, traditional MIL methods often depend on classification tasks that require tumor-free cases as negative examples, which are challenging to obtain in real-world clinical workflows, especially for surgical resection specimens. We address this limitation by reformulating tumor detection as a regression task, estimating tumor percentages from WSIs, a clinically available target across multiple cancer types. In this paper, we provide an analysis of the proposed weakly-supervised regression framework by applying it to multiple organs, specimen types and clinical scenarios. We characterize the robustness of our framework to tumor percentage as a noisy regression target, and introduce a novel concept of amplification technique to improve tumor detection sensitivity when learning from small tumor regions. Finally, we provide interpretable insights into the model's predictions by analyzing visual attention and logit maps. Our code is available at https://github.com/DIAGNijmegen/tumor-percentage-mil-regression.
中文: 本研究提出了一种弱监督回归框架,通过估算临床可用的肿瘤百分比来检测全切片图像中的肿瘤,无需无肿瘤样本,并利用放大技术提高检测灵敏度。
English: This study introduces a weakly-supervised regression framework for tumor detection in whole-slide images, eliminating the need for tumor-free cases by estimating clinically available tumor percentages and enhancing sensitivity through amplification techniques.

Authors:Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, Yulan He
Title: CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Abstract:
Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale, achieving a 3.1x compression rate and outperforming the previous state-of-the-art by 28.2% in accuracy. CODI also demonstrates robustness, generalizable to complex datasets, and interpretability. These results validate that LLMs can reason effectively not only in natural language, but also in a latent continuous space. Code is available at https://github.com/zhenyi4/codi.
中文摘要:CODI提出了一种新颖的训练框架,能够将自然语言思维链推理有效压缩至连续空间,在保持与显式方法相当性能的同时,展现出更优的效率、鲁棒性和可解释性。
English Summary: CODI introduces a novel training framework that effectively compresses natural language chain-of-thought reasoning into continuous space, achieving comparable performance to explicit methods while demonstrating superior efficiency, robustness, and interpretability.

Authors:Jingru Fu, Yuqi Zheng, Neel Dey, Daniel Ferreira, Rodrigo Moreno
Title: Synthesizing Individualized Aging Brains in Health and Disease with Generative Models and Parallel Transport
Abstract:
Simulating prospective magnetic resonance imaging (MRI) scans from a given individual brain image is challenging, as it requires accounting for canonical changes in aging and/or disease progression while also considering the individual brain's current status and unique characteristics. While current deep generative models can produce high-resolution anatomically accurate templates for population-wide studies, their ability to predict future aging trajectories for individuals remains limited, particularly in capturing subject-specific neuroanatomical variations over time. In this study, we introduce Individualized Brain Synthesis (InBrainSyn), a framework for synthesizing high-resolution subject-specific longitudinal MRI scans that simulate neurodegeneration in both Alzheimer's disease (AD) and normal aging. InBrainSyn uses a parallel transport algorithm to adapt the population-level aging trajectories learned by a generative deep template network, enabling individualized aging synthesis. As InBrainSyn uses diffeomorphic transformations to simulate aging, the synthesized images are topologically consistent with the original anatomy by design. We evaluated InBrainSyn both quantitatively and qualitatively on AD and healthy control cohorts from the Open Access Series of Imaging Studies - version 3 dataset. Experimentally, InBrainSyn can also model neuroanatomical transitions between normal aging and AD. An evaluation of an external set supports its generalizability. Overall, with only a single baseline scan, InBrainSyn synthesizes realistic 3D spatiotemporal T1w MRI scans, producing personalized longitudinal aging trajectories. The code for InBrainSyn is available at: https://github.com/Fjr9516/InBrainSyn.
中文: InBrainSyn框架通过微分同胚变换调整群体水平的老化轨迹,能够基于单次基线扫描合成个性化的纵向MRI图像,实现阿尔茨海默病和正常衰老中神经退行性变的真实模拟。
English: The InBrainSyn framework synthesizes personalized longitudinal MRI scans by adapting population-level aging trajectories through diffeomorphic transformations, enabling realistic simulation of neurodegeneration in Alzheimer's disease and normal aging from a single baseline scan.

Authors:Chanhui Lee, Yeonghwan Song, Jeany Son
Title: Data-free Universal Adversarial Perturbation with Pseudo-semantic Prior
Abstract:
Data-free Universal Adversarial Perturbation (UAP) is an image-agnostic adversarial attack that deceives deep neural networks using a single perturbation generated solely from random noise without relying on data priors. However, traditional data-free UAP methods often suffer from limited transferability due to the absence of semantic content in random noise. To address this issue, we propose a novel data-free universal attack method that recursively extracts pseudo-semantic priors directly from the UAPs during training to enrich the semantic content within the data-free UAP framework. Our approach effectively leverages latent semantic information within UAPs via region sampling, enabling successful input transformations-typically ineffective in traditional data-free UAP methods due to the lack of semantic cues-and significantly enhancing black-box transferability. Furthermore, we introduce a sample reweighting technique to mitigate potential imbalances from random sampling and transformations, emphasizing hard examples less affected by the UAPs. Comprehensive experiments on ImageNet show that our method achieves state-of-the-art performance in average fooling rate by a substantial margin, notably improves attack transferability across various CNN architectures compared to existing data-free UAP methods, and even surpasses data-dependent UAP methods. Code is available at: https://github.com/ChnanChan/PSP-UAP.
Chinese: 本文提出了一种新颖的无数据通用对抗扰动方法,通过在训练中递归提取扰动中的伪语义先验来增强迁移性,在不依赖数据的情况下实现了跨多种CNN架构的最优性能。
English: This paper introduces a novel data-free universal adversarial perturbation method that enhances transferability by recursively extracting pseudo-semantic priors from perturbations during training, achieving state-of-the-art performance across multiple CNN architectures without relying on data.

Authors:Fangxu Yu, Lai Jiang, Shenyi Huang, Zhen Wu, Xinyu Dai
Title: PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
Abstract:
The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social scenarios. Although recent studies have evaluated ToM in Large Language Models (LLMs), existing benchmarks focus on simplified settings (e.g., Sally-Anne-style tasks) and overlook the complexity of real-world social interactions. To mitigate this gap, we propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues. Our framework contains two core tasks: ToM Reasoning, which tests tracking of evolving desires, beliefs, and intentions; and ToM Application, which assesses the use of inferred mental states to predict and evaluate persuasion strategies. Experiments across eight leading LLMs reveal that while models excel on multiple questions, they struggle with the tasks that need tracking the dynamics and shifts of mental states and understanding the mental states in the whole dialogue comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities. Our code is available at https://github.com/Yu-Fangxu/PersuasiveToM.
Chinese: PersuasiveToM基准测试通过说服性对话评估大语言模型的心理理论能力,发现尽管模型在简单任务上表现出色,但在追踪动态心理状态方面仍存在明显不足。
English: The PersuasiveToM benchmark evaluates large language models' Theory of Mind abilities in persuasive dialogues, revealing their limitations in tracking dynamic mental states despite strong performance on simpler tasks.

Authors:Junchao Zhu, Ruining Deng, Tianyuan Yao, Juming Xiong, Chongyu Qu, Junlin Guo, Siqi Lu, Yucheng Tang, Daguang Xu, Mengmeng Yin, Yu Wang, Shilin Zhao, Yaohong Wang, Haichun Yang, Yuankai Huo
Title: MagNet: Multi-Level Attention Graph Network for Predicting High-Resolution Spatial Transcriptomics
Abstract:
The rapid development of spatial transcriptomics (ST) offers new opportunities to explore the gene expression patterns within the spatial microenvironment. Current research integrates pathological images to infer gene expression, addressing the high costs and time-consuming processes to generate spatial transcriptomics data. However, as spatial transcriptomics resolution continues to improve, existing methods remain primarily focused on gene expression prediction at low-resolution spot levels. These methods face significant challenges, especially the information bottleneck, when they are applied to high-resolution HD data. To bridge this gap, this paper introduces MagNet, a multi-level attention graph network designed for accurate prediction of high-resolution HD data. MagNet employs cross-attention layers to integrate features from multi-resolution image patches hierarchically and utilizes a GAT-Transformer module to aggregate neighborhood information. By integrating multilevel features, MagNet overcomes the limitations posed by low-resolution inputs in predicting high-resolution gene expression. We systematically evaluated MagNet and existing ST prediction models on both a private spatial transcriptomics dataset and a public dataset at three different resolution levels. The results demonstrate that MagNet achieves state-of-the-art performance at both spot level and high-resolution bin levels, providing a novel methodology and benchmark for future research and applications in high-resolution HD-level spatial transcriptomics. Code is available at https://github.com/Junchao-Zhu/MagNet.
中文: 本文提出MagNet多层注意力图网络,通过整合多分辨率图像特征和邻域信息克服高分辨率空间转录组学的信息瓶颈,在不同分辨率级别均实现了最先进的性能。
English: This paper introduces MagNet, a multi-level attention graph network that overcomes the information bottleneck in high-resolution spatial transcriptomics by integrating multi-resolution image features and neighborhood information, achieving state-of-the-art performance across different resolution levels.

Authors:Woo Kyoung Han, Byeonghun Lee, Hyunmin Cho, Sunghoon Im, Kyong Hwan Jin
Title: Towards Lossless Implicit Neural Representation via Bit Plane Decomposition
Abstract:
We quantify the upper bound on the size of the implicit neural representation (INR) model from a digital perspective. The upper bound of the model size increases exponentially as the required bit-precision increases. To this end, we present a bit-plane decomposition method that makes INR predict bit-planes, producing the same effect as reducing the upper bound of the model size. We validate our hypothesis that reducing the upper bound leads to faster convergence with constant model size. Our method achieves lossless representation in 2D image and audio fitting, even for high bit-depth signals, such as 16-bit, which was previously unachievable. We pioneered the presence of bit bias, which INR prioritizes as the most significant bit (MSB). We expand the application of the INR task to bit depth expansion, lossless image compression, and extreme network quantization. Our source code is available at https://github.com/WooKyoungHan/LosslessINR
中文: 本研究提出了一种位平面分解方法,通过降低隐式神经表示模型大小的上限,实现了高比特深度信号的无损表示,并拓展了在比特深度扩展和压缩等领域的应用。
English: This study introduces a bit-plane decomposition method that reduces the upper bound on implicit neural representation (INR) model size, enabling lossless representation for high bit-depth signals and expanding applications to bit depth expansion and compression.

Authors:Xiusheng Huang, Jiaxiang Liu, Yequan Wang, Jun Zhao, Kang Liu
Title: Capability Localization: Capabilities Can be Localized rather than Individual Knowledge
Abstract:
Large scale language models have achieved superior performance in tasks related to natural language processing, however, it is still unclear how model parameters affect performance improvement. Previous studies assumed that individual knowledge is stored in local parameters, and the storage form of individual knowledge is dispersed parameters, parameter layers, or parameter chains, which are not unified. We found through fidelity and reliability evaluation experiments that individual knowledge cannot be localized. Afterwards, we constructed a dataset for decoupling experiments and discovered the potential for localizing data commonalities. To further reveal this phenomenon, this paper proposes a Commonality Neuron Localization (CNL) method, which successfully locates commonality neurons and achieves a neuron overlap rate of 96.42% on the GSM8K dataset. Finally, we have demonstrated through cross data experiments that commonality neurons are a collection of capability neurons that possess the capability to enhance performance. Our code is available at https://github.com/nlpkeg/Capability-Neuron-Localization.
中文: 本研究挑战了关于大语言模型中个体知识存储于局部参数的假设,提出共性神经元定位方法,在GSM8K数据集上成功定位共性神经元并达到96.42%的重叠率,验证了这些神经元作为能力神经元集合对性能提升的作用。
English: This study challenges the assumption that individual knowledge is stored in localized parameters of large language models, proposing a Commonality Neuron Localization method that successfully identifies shared capability neurons with a 96.42% overlap rate on GSM8K, demonstrating their role in performance enhancement.

Authors:Thanet Markchom, Tong Wu, Liting Huang, Huizhi Liang
Title: UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
Abstract:
SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning. The source code used in this paper is available at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.
中文: 本研究通过使用大语言模型生成习语含义和多语言CLIP模型进行编码,提升了图像与习语复合词的匹配排序效果,实验表明多模态表征优于仅基于原始复合词的方法。
English: This study enhances image ranking for idiomatic compounds by using LLMs to generate meanings and multilingual CLIP models for encoding, with results showing improved performance through multimodal representations over original compounds.

Authors:Yujie Li, Xiangkun Wang, Xin Yang, Marcello Bonsangue, Junbo Zhang, Tianrui Li
Title: Improving Open-world Continual Learning under the Constraints of Scarce Labeled Data
Abstract:
Open-world continual learning (OWCL) adapts to sequential tasks with open samples, learning knowledge incrementally while preventing forgetting. However, existing OWCL still requires a large amount of labeled data for training, which is often impractical in real-world applications. Given that new categories/entities typically come with limited annotations and are in small quantities, a more realistic situation is OWCL with scarce labeled data, i.e., few-shot training samples. Hence, this paper investigates the problem of open-world few-shot continual learning (OFCL), challenging in (i) learning unbounded tasks without forgetting previous knowledge and avoiding overfitting, (ii) constructing compact decision boundaries for open detection with limited labeled data, and (iii) transferring knowledge about knowns and unknowns and even update the unknowns to knowns once the labels of open samples are learned. In response, we propose a novel OFCL framework that integrates three key components: (1) an instance-wise token augmentation (ITA) that represents and enriches sample representations with additional knowledge, (2) a margin-based open boundary (MOB) that supports open detection with new tasks emerge over time, and (3) an adaptive knowledge space (AKS) that endows unknowns with knowledge for the updating from unknowns to knowns. Finally, extensive experiments show that the proposed OFCL framework outperforms all baselines remarkably with practical importance and reproducibility. The source code is released at https://github.com/liyj1201/OFCL.
中文: 本文提出了一种新颖的开放世界小样本持续学习框架,通过实例化标记增强、基于边界的开放检测和自适应知识空间,解决了有限标注数据下的持续学习与知识迁移难题,实验证明其性能显著优于现有方法。
English: This paper introduces a novel framework for open-world few-shot continual learning (OFCL), addressing the challenges of learning from limited labeled data while preventing forgetting and enabling knowledge transfer through instance-wise token augmentation, margin-based open boundary, and adaptive knowledge space, with experimental results showing superior performance over existing methods.

Authors:Xue Yang, Tao Chen, Lei Guo, Wenbo Jiang, Ji Guo, Yongming Li, Jiaming He
Title: BadRefSR: Backdoor Attacks Against Reference-based Image Super Resolution
Abstract:
Reference-based image super-resolution (RefSR) represents a promising advancement in super-resolution (SR). In contrast to single-image super-resolution (SISR), RefSR leverages an additional reference image to help recover high-frequency details, yet its vulnerability to backdoor attacks has not been explored. To fill this research gap, we propose a novel attack framework called BadRefSR, which embeds backdoors in the RefSR model by adding triggers to the reference images and training with a mixed loss function. Extensive experiments across various backdoor attack settings demonstrate the effectiveness of BadRefSR. The compromised RefSR network performs normally on clean input images, while outputting attacker-specified target images on triggered input images. Our study aims to alert researchers to the potential backdoor risks in RefSR. Codes are available at https://github.com/xuefusiji/BadRefSR.
中文:BadRefSR提出了一种针对参考图像超分辨率模型的后门攻击框架,通过在参考图像中嵌入触发器来操控输出结果,同时在干净输入上保持正常性能,揭示了该技术潜在的安全隐患。
English: BadRefSR introduces a backdoor attack framework for reference-based image super-resolution models, embedding triggers in reference images to manipulate outputs while maintaining normal performance on clean inputs, highlighting security risks in RefSR systems.

Authors:Shawxing Kwok
Title: A Faster Algorithm for Maximum Weight Matching on Unrestricted Bipartite Graphs
Abstract:
Given a weighted bipartite graph $G = (L, R, E, w)$, the maximum weight matching (MWM) problem seeks to find a matching $M \subseteq E$ that maximizes the total weight $\sum_{e \in M} w(e)$. This paper presents a novel algorithm with a time complexity of $O(\min(X^3 + E, XE + X^2\log X))$, where $X = \min(|L|, |R|)$. Unlike many existing algorithms, our approach supports real-valued weights without additional constraints. Under this condition, our result improves upon the previous best-known bound of $O(VE + V^2\log V)$, or more strictly $O(XE + XV\log V)$, where $V = L \cup R$. The suggested implementation code is simplified and publicly available at https://github.com/ShawxingKwok/Kwok-algorithm, with the average-case time complexity of $O(E^{1.4} + LR)$ estimated from experimental results on random graphs.
中文: 本文针对加权二分图的最大权重匹配问题提出了一种新算法,其时间复杂度为 \(O(\min(X^3 + E, XE + X^2\log X))\),其中 \(X = \min(|L|, |R|)\),该算法在支持无约束实数值权重的同时,改进了先前的最佳已知时间复杂度。
English: This paper introduces a novel algorithm for the maximum weight matching problem in weighted bipartite graphs, achieving a time complexity of \(O(\min(X^3 + E, XE + X^2\log X))\) with \(X = \min(|L|, |R|)\), which improves upon previous bounds and supports real-valued weights without constraints.

Authors:Shaoming Li, Qing Cai, Songqi Kong, Runqing Tan, Heng Tong, Shiji Qiu, Yongguo Jiang, Zhi Liu
Title: MESC-3D:Mining Effective Semantic Cues for 3D Reconstruction from a Single Image
Abstract:
Reconstructing 3D shapes from a single image plays an important role in computer vision. Many methods have been proposed and achieve impressive performance. However, existing methods mainly focus on extracting semantic information from images and then simply concatenating it with 3D point clouds without further exploring the concatenated semantics. As a result, these entangled semantic features significantly hinder the reconstruction performance. In this paper, we propose a novel single-image 3D reconstruction method called Mining Effective Semantic Cues for 3D Reconstruction from a Single Image (MESC-3D), which can actively mine effective semantic cues from entangled features. Specifically, we design an Effective Semantic Mining Module to establish connections between point clouds and image semantic attributes, enabling the point clouds to autonomously select the necessary information. Furthermore, to address the potential insufficiencies in semantic information from a single image, such as occlusions, inspired by the human ability to represent 3D objects using prior knowledge drawn from daily experiences, we introduce a 3D Semantic Prior Learning Module. This module incorporates semantic understanding of spatial structures, enabling the model to interpret and reconstruct 3D objects with greater accuracy and realism, closely mirroring human perception of complex 3D environments. Extensive evaluations show that our method achieves significant improvements in reconstruction quality and robustness compared to prior works. Additionally, further experiments validate the strong generalization capabilities and excels in zero-shot preformance on unseen classes. Code is available at https://github.com/QINGQINGLE/MESC-3D.
中文: 提出的MESC-3D方法通过从纠缠特征中主动挖掘有效语义线索并引入三维语义先验知识,显著提升了单图像三维重建的质量、鲁棒性和泛化能力,优于现有方法。
English: The proposed MESC-3D method enhances single-image 3D reconstruction by actively mining effective semantic cues from entangled features and incorporating 3D semantic priors, achieving superior quality, robustness, and generalization compared to existing approaches.

Authors:Jonathan Drechsel, Anja Reusch, Steffen Herbold
Title: MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training
Abstract:
Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation. Experiments show that models trained on these datasets exhibit new SoTA performance on mathematical retrieval tasks. We publish our code, generated datasets, and pretrained mathematical models: https://github.com/aieng-lab/math-mutator.
中文摘要:本研究提出Math Mutator (MAMUT)框架,通过生成多样化数学公式变体构建专业训练数据集,在数学检索任务中实现了最先进的性能表现。
English Summary: This study introduces Math Mutator (MAMUT), a framework that generates diverse mathematical formula variations to create specialized training datasets, achieving state-of-the-art performance in mathematical retrieval tasks.

Authors:Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen
Title: Oscillation-Reduced MXFP4 Training for Vision Transformers
Abstract:
Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason. In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA & Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than $50\%$ compared to the baseline, and can even achieve competitive performance compared to full precision training. The codes are available at https://github.com/thu-ml/TetraJet-MXFP4Training
中文摘要:本研究提出TetraJet新型训练方法,通过识别并利用Q-EMA和Q-Ramping技术解决MXFP4训练中的权重振荡问题,相比基线方法将精度损失降低超过50%,在视觉Transformer上实现了与全精度训练相媲美的性能。
English Summary: The study introduces TetraJet, a novel training method that addresses accuracy degradation in FP4 pre-training by identifying and mitigating weight oscillation through Q-EMA and Q-Ramping techniques, achieving over 50% reduction in accuracy loss compared to baseline methods.

Authors:Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Kazuhiro Nakadai
Title: Weakly Supervised Multiple Instance Learning for Whale Call Detection and Temporal Localization in Long-Duration Passive Acoustic Monitoring
Abstract:
Marine ecosystem monitoring via Passive Acoustic Monitoring (PAM) generates vast data, but deep learning often requires precise annotations and short segments. We introduce DSMIL-LocNet, a Multiple Instance Learning framework for whale call detection and localization using only bag-level labels. Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features with attention-based instance selection. Tests on Antarctic whale data show longer contexts improve classification (F1: 0.8-0.9) while medium instances ensure localization precision (0.65-0.70). This suggests MIL can enhance scalable marine monitoring. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc
中文:DSMIL-LocNet框架通过多示例学习仅需包级标签即可检测和定位鲸鱼叫声,在南极鲸鱼数据上展现出更好的分类和定位能力,有助于可扩展的海洋监测。
English: The DSMIL-LocNet framework uses multiple instance learning to detect and locate whale calls with only bag-level labels, demonstrating improved classification and localization on Antarctic whale data for scalable marine monitoring.

Authors:Long Chen, Xianchao Xiu
Title: Tuning-Free Structured Sparse PCA via Deep Unfolding Networks
Abstract:
Sparse principal component analysis (PCA) is a well-established dimensionality reduction technique that is often used for unsupervised feature selection (UFS). However, determining the regularization parameters is rather challenging, and conventional approaches, including grid search and Bayesian optimization, not only bring great computational costs but also exhibit high sensitivity. To address these limitations, we first establish a structured sparse PCA formulation by integrating $\ell_1$-norm and $\ell_{2,1}$-norm to capture the local and global structures, respectively. Building upon the off-the-shelf alternating direction method of multipliers (ADMM) optimization framework, we then design an interpretable deep unfolding network that translates iterative optimization steps into trainable neural architectures. This innovation enables automatic learning of the regularization parameters, effectively bypassing the empirical tuning requirements of conventional methods. Numerical experiments on benchmark datasets validate the advantages of our proposed method over the existing state-of-the-art methods. Our code will be accessible at https://github.com/xianchaoxiu/SPCA-Net.
Chinese: 本文提出了一种深度展开网络,能够自动学习稀疏主成分分析的正则化参数,有效避免了传统方法的高计算成本和敏感性,并在基准数据集上验证了其优于现有方法的性能。
English: This paper introduces a deep unfolding network that automatically learns regularization parameters for sparse PCA, overcoming the computational cost and sensitivity of traditional methods while outperforming existing approaches on benchmark datasets.

Authors:Shu Liu, Xiangxi Mo, Moshik Hershcovitch, Henric Zhang, Audrey Cheng, Guy Girmonsky, Gil Vernik, Michael Factor, Tiemo Bang, Soujanya Ponnapalli, Natacha Crooks, Joseph E. Gonzalez, Danny Harnik, Ion Stoica
Title: SkyStore: Cost-Optimized Object Storage Across Regions and Clouds
Abstract:
Modern applications span multiple clouds to reduce costs, avoid vendor lock-in, and leverage low-availability resources in another cloud. However, standard object stores operate within a single cloud, forcing users to manually manage data placement across clouds, i.e., navigate their diverse APIs and handle heterogeneous costs for network and storage. This is often a complex choice: users must either pay to store objects in a remote cloud, or pay to transfer them over the network based on application access patterns and cloud provider cost offerings. To address this, we present SkyStore, a unified object store that addresses cost-optimal data management across regions and clouds. SkyStore introduces a virtual object and bucket API to hide the complexity of interacting with multiple clouds. At its core, SkyStore has a novel TTL-based data placement policy that dynamically replicates and evicts objects according to application access patterns while optimizing for lower cost. Our evaluation shows that across various workloads, SkyStore reduces the overall cost by up to 6x over academic baselines and commercial alternatives like AWS multi-region buckets. SkyStore also has comparable latency, and its availability and fault tolerance are on par with standard cloud offerings. We release the data and code of SkyStore at https://github.com/skyplane-project/skystore.
中文摘要:SkyStore作为一种统一对象存储,通过基于访问模式动态优化数据放置来简化多云数据管理,在保持与标准云服务相当的性能和可用性的同时,显著降低了成本。
English Summary: SkyStore is a unified object store that simplifies cross-cloud data management by dynamically optimizing data placement based on access patterns to significantly reduce costs while maintaining performance and availability comparable to standard cloud services.

Authors:Bach-Thuan Bui, Huy-Hoang Bui, Yasuyuki Fujii, Dinh-Tuan Tran, Joo-Ho Lee
Title: Improved 3D Point-Line Mapping Regression for Camera Relocalization
Abstract:
In this paper, we present a new approach for improving 3D point and line mapping regression for camera re-localization. Previous methods typically rely on feature matching (FM) with stored descriptors or use a single network to encode both points and lines. While FM-based methods perform well in large-scale environments, they become computationally expensive with a growing number of mapping points and lines. Conversely, approaches that learn to encode mapping features within a single network reduce memory footprint but are prone to overfitting, as they may capture unnecessary correlations between points and lines. We propose that these features should be learned independently, each with a distinct focus, to achieve optimal accuracy. To this end, we introduce a new architecture that learns to prioritize each feature independently before combining them for localization. Experimental results demonstrate that our approach significantly enhances the 3D map point and line regression performance for camera re-localization. The implementation of our method will be publicly available at: https://github.com/ais-lab/pl2map/.
中文: 本文提出了一种新架构,通过独立学习点和线特征再整合用于相机重定位,显著提升了3D映射回归性能,有效解决了现有方法计算量大和容易过拟合的问题。
English: This paper introduces a novel architecture that independently learns point and line features before integrating them for camera re-localization, significantly improving 3D mapping regression while overcoming the computational and overfitting limitations of prior methods.

Authors:Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
Title: Plan2Align: Predictive Planning Based Test-Time Preference Alignment for Large Language Models
Abstract:
Aligning Large Language Models with Preference Fine-Tuning is often resource-intensive. Test-time alignment techniques that do not modify the underlying models, such as prompting and guided decodings, offer a lightweight alternative. However, existing test-time alignment methods primarily improve short responses and fail to ensure coherence over extended contexts due to the myopic nature of token-level alignment. Moreover, these methods often incur a slowdown during inference. To address these challenges, we propose Plan2Align, a test-time alignment framework that formulates text generation as a predictive planning problem. Plan2Align adapts Model Predictive Control (MPC) to iteratively refine output by rolling out multiple complete responses and optimizing each segment. To more rigorously evaluate the effectiveness and efficiency, we focus on the more challenging task of long-text generation. Experiments on the long-form response subset of the HH-RLHF dataset and the WMT'24 Discourse-Level Literary Translation demonstrate that Plan2Align significantly enhances the performance of base LLMs. Compared to existing training-time and test-time alignment methods on LLaMA-3.1 8B, Plan2Align achieves comparable or superior results, while also delivering improved inference efficiency relative to prior test-time alignment approaches.
中文摘要:Plan2Align是一种测试时对齐框架,将文本生成视为预测性规划问题,通过模型预测控制迭代优化输出,在长文本生成任务中相比现有方法展现出更优的性能和效率。
English Summary: Plan2Align is a test-time alignment framework that treats text generation as a predictive planning task, using Model Predictive Control to iteratively refine outputs and demonstrating superior performance and efficiency in long-text generation tasks compared to existing methods.

Authors:Yong Fang
Title: Overlapped Arithmetic Codes
Abstract:
Arithmetic codes are usually deemed as the most important means to implement lossless source coding, whose principle is mapping every source symbol to a sub-interval in [0, 1). For every source symbol, the length of its mapping sub-interval is exactly equal to its probability. With this symbol-interval mapping rule, the interval [0,1) will be fully covered and there is neither overlapped sub-interval (corresponds to more than one source symbol) nor forbidden sub-interval (does not correspond to any source symbol). It is well-known that there is a duality between source coding and channel coding, so every good source code may also be a good channel code meanwhile, and vice versa. Inspired by this duality, arithmetic codes can be easily generalized to address many coding problems beyond source coding by redefining the source-interval mapping rule. If every source symbol is mapped to an enlarged sub-interval, the mapping sub-intervals of different source symbols will be partially overlapped and we obtain overlapped arithmetic codes, which can realize distributed source coding. On the contrary, if every source symbol is mapped to a narrowed sub-interval, there will be one or more forbidden sub-intervals in [0, 1) that do not correspond to any source symbol and we obtain forbidden arithmetic codes, which can implement joint source-channel coding. Furthermore, by allowing the coexistence of overlapped sub-intervals and forbidden sub-intervals, we will obtain hybrid arithmetic codes, which can cope with distributed joint source-channel coding.
中文: 算术编码通过将源符号按概率映射到[0,1)区间内互不重叠的子区间实现无损信源编码,而通过重新定义区间映射规则——使子区间重叠或产生禁区,可将其扩展应用于分布式信源编码和联合信源信道编码。
English: Arithmetic codes map source symbols to non-overlapping sub-intervals in [0,1) based on their probabilities for lossless source coding, and by adjusting these intervals to overlap or create forbidden zones, they can be extended to distributed and joint source-channel coding applications.

Authors:Qiao Yan, Yuchen Yuan, Xiaowei Hu, Yihan Wang, Jiaqi Xu, Jinpeng Li, Chi-Wing Fu, Pheng-Ann Heng
Title: MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models
Abstract:
The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at \href{https://github.com/russellyq/MedHallTune}{MedHallTune}.
中文: 本研究提出了MedHallTune这一大规模基准,用于评估和减少医学视觉语言模型的幻觉问题,实验表明基于该数据的微调能有效提升模型在医疗应用中的可靠性。
English: This study introduces MedHallTune, a large-scale benchmark to evaluate and reduce hallucinations in medical vision-language models, demonstrating that fine-tuning with it enhances model reliability for healthcare applications.

Authors:Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, Yan Lu
Title: Towards Practical Real-Time Neural Video Compression
Abstract:
We introduce a practical real-time neural video codec (NVC) designed to deliver high compression ratio, low latency and broad versatility. In practice, the coding speed of NVCs depends on 1) computational costs, and 2) non-computational operational costs, such as memory I/O and the number of function calls. While most efficient NVCs prioritize reducing computational cost, we identify operational cost as the primary bottleneck to achieving higher coding speed. Leveraging this insight, we introduce a set of efficiency-driven design improvements focused on minimizing operational costs. Specifically, we employ implicit temporal modeling to eliminate complex explicit motion modules, and use single low-resolution latent representations rather than progressive downsampling. These innovations significantly accelerate NVC without sacrificing compression quality. Additionally, we implement model integerization for consistent cross-device coding and a module-bank-based rate control scheme to improve practical adaptability. Experiments show our proposed DCVC-RT achieves an impressive average encoding/decoding speed at 125.2/112.8 fps (frames per second) for 1080p video, while saving an average of 21% in bitrate compared to H.266/VTM. The code is available at https://github.com/microsoft/DCVC.
中文摘要:本文提出一种实时神经视频编解码器,通过隐式时间建模和单一低分辨率潜在表示降低操作成本,在1080p视频上实现125.2/112.8 fps的编解码速度,相比H.266/VTM节省21%码率。
English Summary: This paper presents a real-time neural video codec that enhances coding speed by minimizing operational costs through implicit temporal modeling and single low-resolution latent representations, achieving 125.2/112.8 fps for 1080p video while reducing bitrate by 21% compared to H.266/VTM.

Authors:Ben Walters, Yeshwanth Bethi, Taylor Kergan, Binh Nguyen, Amirali Amirsoleimani, Jason K. Eshraghian, Saeed Afshar, Mostafa Rahimi Azghadi
Title: NeuroMorse: A Temporally Structured Dataset For Neuromorphic Computing
Abstract:
Neuromorphic engineering aims to advance computing by mimicking the brain's efficient processing, where data is encoded as asynchronous temporal events. This eliminates the need for a synchronisation clock and minimises power consumption when no data is present. However, many benchmarks for neuromorphic algorithms primarily focus on spatial features, neglecting the temporal dynamics that are inherent to most sequence-based tasks. This gap may lead to evaluations that fail to fully capture the unique strengths and characteristics of neuromorphic systems. In this paper, we present NeuroMorse, a temporally structured dataset designed for benchmarking neuromorphic learning systems. NeuroMorse converts the top 50 words in the English language into temporal Morse code spike sequences. Despite using only two input spike channels for Morse dots and dashes, complex information is encoded through temporal patterns in the data. The proposed benchmark contains feature hierarchy at multiple temporal scales that test the capacity of neuromorphic algorithms to decompose input patterns into spatial and temporal hierarchies. We demonstrate that our training set is challenging to categorise using a linear classifier and that identifying keywords in the test set is difficult using conventional methods. The NeuroMorse dataset is available at Zenodo, with our accompanying code on GitHub at https://github.com/Ben-E-Walters/NeuroMorse.
中文摘要:NeuroMorse提出了一种基于莫尔斯电码脉冲序列的时间结构化数据集,用于评估神经形态系统,旨在解决当前基准测试中忽视时间动态的问题,并检验算法处理复杂时间层次结构的能力。
English Summary: NeuroMorse introduces a temporally structured dataset using Morse code spike sequences to benchmark neuromorphic systems, addressing the current neglect of temporal dynamics in evaluations and testing algorithms' ability to process complex temporal hierarchies.

Authors:Ke Sun, Shen Chen, Taiping Yao, Ziyin Zhou, Jiayi Ji, Xiaoshuai Sun, Chia-Wen Lin, Rongrong Ji
Title: Towards General Visual-Linguistic Face Forgery Detection(V2)
Abstract:
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust. Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection. However, existing annotation approaches, whether through human labeling or direct Multimodal Large Language Model (MLLM) generation, often suffer from hallucination issues, leading to inaccurate text descriptions, especially for high-quality forgeries. To address this, we propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification, followed by a comprehensive prompting strategy to guide MLLMs in reducing hallucination. We validate our approach through fine-tuning both CLIP with a three-branch training framework combining unimodal and multimodal objectives, and MLLMs with our structured annotations. Experimental results demonstrate that our method not only achieves more accurate annotations with higher region identification accuracy, but also leads to improvements in model performance across various forgery detection benchmarks. Our Codes are available in https://github.com/skJack/VLFFD.git.
中文: 面部伪造文本生成器(FFTG)是一种新颖的标注流程,通过利用伪造掩码和全面的提示策略来减少多模态模型中的幻觉问题,从而提高了面部伪造检测文本描述的准确性,并在多个基准测试中提升了模型性能。
English: The Face Forgery Text Generator (FFTG) is a novel annotation pipeline that enhances the accuracy of text descriptions for face forgery detection by utilizing forgery masks and a comprehensive prompting strategy to reduce hallucination in multimodal models, leading to improved performance across benchmarks.

Authors:Shanshan Wan, Yingmei Wei, Lai Kang, Tianrui Shen, Haixuan Wang, Yee-Hong Yang
Title: SciceVPR: Stable Cross-Image Correlation Enhanced Model for Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) is a major challenge for robotics and autonomous systems, with the goal of predicting the location of an image based solely on its visual features. State-of-the-art (SOTA) models extract global descriptors using the powerful foundation model DINOv2 as backbone. These models either explore the cross-image correlation or propose a time-consuming two-stage re-ranking strategy to achieve better performance. However, existing works only utilize the final output of DINOv2, and the current cross-image correlation causes unstable retrieval results. To produce both discriminative and constant global descriptors, this paper proposes stable cross-image correlation enhanced model for VPR called SciceVPR. This model explores the full potential of DINOv2 in providing useful feature representations that implicitly encode valuable contextual knowledge. Specifically, SciceVPR first uses a multi-layer feature fusion module to capture increasingly detailed task-relevant channel and spatial information from the multi-layer output of DINOv2. Secondly, SciceVPR considers the invariant correlation between images within a batch as valuable knowledge to be distilled into the proposed self-enhanced encoder. In this way, SciceVPR can acquire fairly robust global features regardless of domain shifts (e.g., changes in illumination, weather and viewpoint between pictures taken in the same place). Experimental results demonstrate that the base variant, SciceVPR-B, outperforms SOTA one-stage methods with single input on multiple datasets with varying domain conditions. The large variant, SciceVPR-L, performs on par with SOTA two-stage models, scoring over 3% higher in Recall@1 compared to existing models on the challenging Tokyo24/7 dataset. Our code will be released at https://github.com/shuimushan/SciceVPR.
中文: 本文提出SciceVPR模型,通过融合DINOv2多层特征并提取稳定的跨图像关联知识,在多种域条件下实现了鲁棒的视觉位置识别性能。
English: This paper introduces SciceVPR, a model that enhances visual place recognition by leveraging multi-layer DINOv2 features and distilling stable cross-image correlations to achieve robust performance across varying domain conditions.

Authors:Yingqi Gao, Zhiling Luo
Title: Automatic database description generation for Text-to-SQL
Abstract:
In the context of the Text-to-SQL task, table and column descriptions are crucial for bridging the gap between natural language and database schema. This report proposes a method for automatically generating effective database descriptions when explicit descriptions are unavailable. The proposed method employs a dual-process approach: a coarse-to-fine process, followed by a fine-to-coarse process. The coarse-to-fine approach leverages the inherent knowledge of LLM to guide the understanding process from databases to tables and finally to columns. This approach provides a holistic understanding of the database structure and ensures contextual alignment. Conversely, the fine-to-coarse approach starts at the column level, offering a more accurate and nuanced understanding when stepping back to the table level. Experimental results on the Bird benchmark indicate that using descriptions generated by the proposed improves SQL generation accuracy by 0.93\% compared to not using descriptions, and achieves 37\% of human-level performance. The source code is publicly available at https://github.com/XGenerationLab/XiYan-DBDescGen.
Chinese: 该报告提出了一种双过程方法,在Text-to-SQL任务中自动生成数据库描述,通过从粗到细和从细到粗相结合的方式,在Bird基准测试中将SQL生成准确率提升了0.93%。
English: This report introduces a dual-process method for automatically generating database descriptions in Text-to-SQL tasks, combining coarse-to-fine and fine-to-coarse approaches to improve SQL generation accuracy by 0.93% on the Bird benchmark.

Authors:Yu Pan, Jiahao Chen, Bingrong Dai, Lin Wang, Yi Du, Jiao Liu
Title: Gungnir: Exploiting Stylistic Features in Images for Backdoor Attacks on Diffusion Models
Abstract:
In recent years, Diffusion Models (DMs) have demonstrated significant advances in the field of image generation. However, according to current research, DMs are vulnerable to backdoor attacks, which allow attackers to control the model's output by inputting data containing covert triggers, such as a specific visual patch or phrase. Existing defense strategies are well equipped to thwart such attacks through backdoor detection and trigger inversion because previous attack methods are constrained by limited input spaces and low-dimensional triggers. For example, visual triggers are easily observed by defenders, text-based or attention-based triggers are more susceptible to neural network detection. To explore more possibilities of backdoor attack in DMs, we propose Gungnir, a novel method that enables attackers to activate the backdoor in DMs through style triggers within input images. Our approach proposes using stylistic features as triggers for the first time and implements backdoor attacks successfully in image-to-image tasks by introducing Reconstructing-Adversarial Noise (RAN) and Short-Term Timesteps-Retention (STTR). Our technique generates trigger-embedded images that are perceptually indistinguishable from clean images, thus bypassing both manual inspection and automated detection neural networks. Experiments demonstrate that Gungnir can easily bypass existing defense methods. Among existing DM defense frameworks, our approach achieves a 0 backdoor detection rate (BDR). Our codes are available at https://github.com/paoche11/Gungnir.
中文摘要:本文提出Gungnir方法,首次在扩散模型中利用风格特征作为隐蔽触发器,通过重构对抗噪声和短时步保留技术实现无法被检测的后门攻击,实验表明该方法能完全规避现有防御机制。
English Summary: The paper introduces Gungnir, a novel backdoor attack method for Diffusion Models that uses imperceptible style triggers and specialized noise techniques to bypass existing defenses, achieving a 0% detection rate in experiments.

Authors:Haitao Li, Yifan Chen, Yiran Hu, Qingyao Ai, Junjie Chen, Xiaoyu Yang, Jianhui Yang, Yueyue Wu, Zeyang Liu, Yiqun Liu
Title: LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation
Abstract:
Retrieval-augmented generation (RAG) has proven highly effective in improving large language models (LLMs) across various domains. However, there is no benchmark specifically designed to assess the effectiveness of RAG in the legal domain, which restricts progress in this area. To fill this gap, we propose LexRAG, the first benchmark to evaluate RAG systems for multi-turn legal consultations. LexRAG consists of 1,013 multi-turn dialogue samples and 17,228 candidate legal articles. Each sample is annotated by legal experts and consists of five rounds of progressive questioning. LexRAG includes two key tasks: (1) Conversational knowledge retrieval, requiring accurate retrieval of relevant legal articles based on multi-turn context. (2) Response generation, focusing on producing legally sound answers. To ensure reliable reproducibility, we develop LexiT, a legal RAG toolkit that provides a comprehensive implementation of RAG system components tailored for the legal domain. Additionally, we introduce an LLM-as-a-judge evaluation pipeline to enable detailed and effective assessment. Through experimental analysis of various LLMs and retrieval methods, we reveal the key limitations of existing RAG systems in handling legal consultation conversations. LexRAG establishes a new benchmark for the practical application of RAG systems in the legal domain, with its code and data available at https://github.com/CSHaitao/LexRAG.
Chinese: LexRAG是首个针对多轮法律咨询的检索增强生成系统评估基准,通过标注对话和法律条文填补了法律领域专业评估的空白。
English: LexRAG is the first benchmark designed to evaluate retrieval-augmented generation systems for multi-turn legal consultations, featuring annotated dialogues and legal articles to address the lack of specialized assessments in the legal domain.

Authors:Li Yang, Shimaa Naser, Abdallah Shami, Sami Muhaidat, Lyndon Ong, Mérouane Debbah
Title: Towards Zero Touch Networks: Cross-Layer Automated Security Solutions for 6G Wireless Networks
Abstract:
The transition from 5G to 6G mobile networks necessitates network automation to meet the escalating demands for high data rates, ultra-low latency, and integrated technology. Recently, Zero-Touch Networks (ZTNs), driven by Artificial Intelligence (AI) and Machine Learning (ML), are designed to automate the entire lifecycle of network operations with minimal human intervention, presenting a promising solution for enhancing automation in 5G/6G networks. However, the implementation of ZTNs brings forth the need for autonomous and robust cybersecurity solutions, as ZTNs rely heavily on automation. AI/ML algorithms are widely used to develop cybersecurity mechanisms, but require substantial specialized expertise and encounter model drift issues, posing significant challenges in developing autonomous cybersecurity measures. Therefore, this paper proposes an automated security framework targeting Physical Layer Authentication (PLA) and Cross-Layer Intrusion Detection Systems (CLIDS) to address security concerns at multiple Internet protocol layers. The proposed framework employs drift-adaptive online learning techniques and a novel enhanced Successive Halving (SH)-based Automated ML (AutoML) method to automatically generate optimized ML models for dynamic networking environments. Experimental results illustrate that the proposed framework achieves high performance on the public Radio Frequency (RF) fingerprinting and the Canadian Institute for CICIDS2017 datasets, showcasing its effectiveness in addressing PLA and CLIDS tasks within dynamic and complex networking environments. Furthermore, the paper explores open challenges and research directions in the 5G/6G cybersecurity domain. This framework represents a significant advancement towards fully autonomous and secure 6G networks, paving the way for future innovations in network automation and cybersecurity.
中文: 本文提出一种采用漂移自适应在线学习和增强型AutoML的自动化安全框架,以解决5G/6G网络中的网络安全挑战,在物理层认证和跨层入侵检测方面实现高性能,并探讨了未来研究方向。
English: This paper proposes an automated security framework using drift-adaptive online learning and enhanced AutoML to address cybersecurity challenges in 5G/6G networks, achieving high performance in physical layer authentication and cross-layer intrusion detection while exploring future research directions.

Authors:Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, Michael P. Pound
Title: T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting
Abstract:
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denosing U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount.
中文:T2ICount是一种基于扩散的框架,通过分层语义校正模块和表征区域一致性损失提升文本敏感性,在多个基准测试中实现了卓越的零样本物体计数性能。
English: T2ICount is a diffusion-based framework that enhances zero-shot object counting by improving text sensitivity through a Hierarchical Semantic Correction Module and Representational Regional Coherence Loss, achieving superior performance across benchmarks.

Authors:Zhiqiang Shen, Peng Cao, Jinzhu Yang, Osmar R. Zaiane, Zhaolin Chen
Title: Style Content Decomposition-based Data Augmentation for Domain Generalizable Medical Image Segmentation
Abstract:
Due to domain shifts across diverse medical imaging modalities, learned segmentation models often suffer significant performance degradation during deployment. These domain shifts, typically caused by variations in imaging systems, generally comprise two principal components: 1) \textbf{"style" shifts}, referring to global disparities in image properties such as illumination, contrast, and color; and 2) \textbf{"content" shifts}, which involve local discrepancies in anatomical structures. To address domain shifts in medical image segmentation, a core challenge arises: how can we decouple the factors within images that determine their "style" and "content" components? To this end, we first propose a linear style-content decomposition method that factorizes an image into style codes and content maps, explicitly modeling the "style" and "content" components. Building on this, we introduce a \textbf{Sty}le-\textbf{Con}tent decomposition-based data \textbf{a}ugmentation algorithm (StyCona), which leverages this decomposition strategy to guide augmentation of both the global style and local content of source-domain images, enabling the training of a well-generalized model for domain-generalizable medical image segmentation. StyCona is a simple yet effective plug-and-play module that substantially improves model generalization without requiring additional training parameters or modifications to segmentation model architectures. Experiments on cardiac magnetic resonance imaging and fundus photography segmentation tasks, with single and multiple target domains respectively, demonstrate the effectiveness of StyCona and its superiority over state-of-the-art domain generalization methods. The code will be released at https://github.com/Senyh/StyCona.
Chinese: 提出的StyCona算法通过将医学图像分解为风格和内容成分,实现了有效的数据增强,能在不改变模型架构的情况下显著提升分割模型的泛化能力,从而解决跨域适应问题。
English: The proposed StyCona algorithm addresses domain shifts in medical image segmentation by decomposing images into style and content components, enabling effective data augmentation that improves model generalization without architectural changes.

Authors:Vicente Balmaseda, Bokun Wang, Ching-Long Lin, Tianbao Yang
Title: Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning
Abstract:
In self-supervised contrastive learning, negative pairs are typically constructed using an anchor image and a sample drawn from the entire dataset, excluding the anchor. However, this approach can result in the creation of negative pairs with similar semantics, referred to as "false negatives", leading to their embeddings being falsely pushed apart. To address this issue, we introduce GloFND, an optimization-based approach that automatically learns on the fly the threshold for each anchor data to identify its false negatives during training. In contrast to previous methods for false negative discovery, our approach globally detects false negatives across the entire dataset rather than locally within the mini-batch. Moreover, its per-iteration computation cost remains independent of the dataset size. Experimental results on image and image-text data demonstrate the effectiveness of the proposed method. Our implementation is available at https://github.com/vibalcam/GloFND.
中文: 本文提出GloFND方法,通过自适应学习每个锚点的阈值来全局识别自监督对比学习中的假阴性样本,有效避免语义相似样本被错误分离,且计算成本不随数据集规模增加。
English: This paper introduces GloFND, an optimization-based method that globally identifies false negatives in self-supervised contrastive learning by adaptively learning per-anchor thresholds, effectively preventing semantically similar pairs from being incorrectly separated while maintaining computational efficiency independent of dataset size.

Authors:Mingyuan Wu, Jize Jiang, Haozhen Zheng, Meitang Li, Zhaoheng Li, Beitong Tian, Bo Chen, Yongjoo Park, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt
Title: Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Reasoning
Abstract:
Vision Language Models (VLMs) have achieved remarkable success in a wide range of vision applications of increasing complexity and scales, yet choosing the right VLM model size involves a trade-off between response quality and cost. While smaller VLMs are cheaper to run, they typically produce responses only marginally better than random guessing on benchmarks such as MMMU. In this paper, we propose Cache of Thought (CoT), a master apprentice framework for collaborative inference between large and small VLMs. CoT manages high quality query results from large VLMs (master) in a cache, which are then selected via a novel multi modal retrieval and in-context learning to aid the performance of small VLMs (apprentice). We extensively evaluate CoT on various widely recognized and challenging general reasoning benchmarks, and show that CoT increases overall reasoning performance by up to 7.7% under the same budget, and specifically boosts the performance of apprentice VLMs by up to 36.6%. Our code is available at https://github.com/UIUC-MONET/Cache-of-Thoughts
Chinese: 本文提出Cache of Thought (CoT)框架,通过大型视觉语言模型的缓存响应来增强小型模型的推理能力,在相同预算下整体性能提升最高达7.7%。
English: The paper introduces Cache of Thought (CoT), a master-apprentice framework that uses cached responses from large VLMs to enhance the reasoning performance of smaller VLMs, achieving up to a 7.7% overall improvement under the same budget.

Authors:Keisuke Kamahori, Jungo Kasai, Noriyuki Kojima, Baris Kasikci
Title: LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation
Abstract:
Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency. The code of LiteASR is available at https://github.com/efeslab/LiteASR.
Chinese Summary: LiteASR提出了一种针对ASR编码器的低秩压缩方案,通过主成分分析和优化自注意力机制,在保持转录精度的同时将编码器尺寸压缩超过50%,实现了准确性与效率的新帕累托前沿。
English Summary: LiteASR introduces a low-rank compression technique for ASR encoders that reduces model size by over 50% while maintaining transcription accuracy, establishing a new Pareto frontier for speech recognition efficiency.

Authors:Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek
Title: Interpreting CLIP with Hierarchical Sparse Autoencoders
Abstract:
Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA. We make the codebase available at https://github.com/WolodjaZ/MSAE.
Chinese: Matryoshka稀疏自编码器(MSAE)通过分层架构同时优化重构质量与稀疏性,在CLIP等视觉语言模型中实现了最先进的性能,并能通过语义概念提取有效进行特征解释与控制。
English: The Matryoshka Sparse Autoencoder (MSAE) introduces a hierarchical architecture that simultaneously optimizes reconstruction quality and sparsity for vision-language models like CLIP, achieving state-of-the-art performance while enabling effective feature interpretation and control through semantic concept extraction.

Authors:Kai Mei, Wujiang Xu, Shuhang Lin, Yongfeng Zhang
Title: OmniRouter: Budget and Performance Controllable Multi-LLM Routing
Abstract:
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlook global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs and a constrained optimizer is employed to control globally optimal query-model allocation. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https://github.com/agiresearch/OmniRouter.
中文摘要:OmniRouter是一种创新的LLM路由框架,通过将模型选择构建为约束优化问题,在降低计算成本的同时提升了响应准确性,相比现有方法表现更优。
English Summary: OmniRouter is a novel LLM routing framework that formulates model selection as a constrained optimization problem, achieving higher response accuracy while significantly reducing computational costs compared to existing methods.

Authors:Sari Masri, Huthaifa I. Ashqar, Mohammed Elhenawy
Title: Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection
Abstract:
Traffic control in unsignalized urban intersections presents significant challenges due to the complexity, frequent conflicts, and blind spots. This study explores the capability of leveraging Multimodal Large Language Models (MLLMs), such as GPT-4o, to provide logical and visual reasoning by directly using birds-eye-view videos of four-legged intersections. In this proposed method, GPT-4o acts as intelligent system to detect conflicts and provide explanations and recommendations for the drivers. The fine-tuned model achieved an accuracy of 77.14%, while the manual evaluation of the true predicted values of the fine-tuned GPT-4o showed significant achievements of 89.9% accuracy for model-generated explanations and 92.3% for the recommended next actions. These results highlight the feasibility of using MLLMs for real-time traffic management using videos as inputs, offering scalable and actionable insights into intersections traffic management and operation. Code used in this study is available at https://github.com/sarimasri3/Traffic-Intersection-Conflict-Detection-using-images.git.
中文: 本研究证明,多模态大语言模型如GPT-4o能通过分析鸟瞰视频有效管理无信号灯交叉口,实现冲突检测和驾驶建议,在解释说明和可操作建议方面均展现出高准确率。
English: This study demonstrates that Multimodal Large Language Models like GPT-4o can effectively manage unsignalized intersections by analyzing bird's-eye-view videos to detect conflicts and provide driving recommendations, achieving high accuracy in explanations and actionable insights.

Authors:Joana C. Costa, Tiago Roxo, Hugo Proença, Pedro R. M. Inácio
Title: LISArD: Learning Image Similarity to Defend Against Gray-box Adversarial Attacks
Abstract:
State-of-the-art defense mechanisms are typically evaluated in the context of white-box attacks, which is not realistic, as it assumes the attacker can access the gradients of the target network. To protect against this scenario, Adversarial Training (AT) and Adversarial Distillation (AD) include adversarial examples during the training phase, and Adversarial Purification uses a generative model to reconstruct all the images given to the classifier. This paper considers an even more realistic evaluation scenario: gray-box attacks, which assume that the attacker knows the architecture and the dataset used to train the target network, but cannot access its gradients. We provide empirical evidence that models are vulnerable to gray-box attacks and propose LISArD, a defense mechanism that does not increase computational and temporal costs but provides robustness against gray-box and white-box attacks without including AT. Our method approximates a cross-correlation matrix, created with the embeddings of perturbed and clean images, to a diagonal matrix while simultaneously conducting classification learning. Our results show that LISArD can effectively protect against gray-box attacks, can be used in multiple architectures, and carries over its resilience to the white-box scenario. Also, state-of-the-art AD models underperform greatly when removing AT and/or moving to gray-box settings, highlighting the lack of robustness from existing approaches to perform in various conditions (aside from white-box settings). All the source code is available at https://github.com/Joana-Cabral/LISArD.
Chinese: 本文提出LISArD防御机制,通过在分类学习中将近似的扰动与干净图像嵌入的互相关矩阵对角化,无需对抗训练即可有效防御灰盒和白盒攻击,并保持计算效率。
English: This paper introduces LISArD, a defense mechanism that effectively protects models against gray-box and white-box attacks without adversarial training by approximating a cross-correlation matrix of embeddings from perturbed and clean images to a diagonal matrix during classification learning.

Authors:Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kianté Brantley, Wen Sun
Title: $Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
Abstract:
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.
中文: 提出的$Q\sharp$算法采用基于价值的分布强化学习方法优化KL正则化强化学习,在数学推理基准上优于现有方法,同时保持理论保证和更小的KL散度。
English: The proposed $Q\sharp$ algorithm introduces a value-based approach using distributional reinforcement learning to optimize KL-regularized RL, outperforming existing methods in math reasoning while maintaining theoretical guarantees and smaller KL divergence.

Authors:Long Minh Bui, Tho Tran Huu, Duy Dinh, Tan Minh Nguyen, Trong Nghia Hoang
Title: Revisiting Kernel Attention with Correlated Gaussian Process Representation
Abstract:
Transformers have increasingly become the de facto method to model sequential data with state-of-the-art performance. Due to its widespread use, being able to estimate and calibrate its modeling uncertainty is important to understand and design robust transformer models. To achieve this, previous works have used Gaussian processes (GPs) to perform uncertainty calibration for the attention units of transformers and attained notable successes. However, such approaches have to confine the transformers to the space of symmetric attention to ensure the necessary symmetric requirement of their GP's kernel specification, which reduces the representation capacity of the model. To mitigate this restriction, we propose the Correlated Gaussian Process Transformer (CGPT), a new class of transformers whose self-attention units are modeled as cross-covariance between two correlated GPs (CGPs). This allows asymmetries in attention and can enhance the representation capacity of GP-based transformers. We also derive a sparse approximation for CGP to make it scale better. Our empirical studies show that both CGP-based and sparse CGP-based transformers achieve better performance than state-of-the-art GP-based transformers on a variety of benchmark tasks. The code for our experiments is available at https://github.com/MinhLong210/CGP-Transformers.
Chinese: 提出的相关高斯过程变换器(CGPT)通过使用相关高斯过程建模自注意力机制,克服了以往基于高斯过程的变换器必须保持对称性的限制,实现了非对称注意力并提升了模型表达能力,同时通过稀疏近似保持了良好的可扩展性。
English: The proposed Correlated Gaussian Process Transformer (CGPT) overcomes the symmetry limitations of previous GP-based transformers by using correlated Gaussian processes for self-attention, enabling asymmetric attention and improved representation capacity while maintaining scalability through sparse approximation.

Authors:Li-Wei Chen, Ombretta Strafforello, Anne-Sofie Maerten, Tinne Tuytelaars, Johan Wagemans
Title: On the Role of Individual Differences in Current Approaches to Computational Image Aesthetics
Abstract:
Image aesthetic assessment (IAA) evaluates image aesthetics, a task complicated by image diversity and user subjectivity. Current approaches address this in two stages: Generic IAA (GIAA) models estimate mean aesthetic scores, while Personal IAA (PIAA) models adapt GIAA using transfer learning to incorporate user subjectivity. However, a theoretical understanding of transfer learning between GIAA and PIAA, particularly concerning the impact of group composition, group size, aesthetic differences between groups and individuals, and demographic correlations, is lacking. This work establishes a theoretical foundation for IAA, proposing a unified model that encodes individual characteristics in a distributional format for both individual and group assessments. We show that transferring from GIAA to PIAA involves extrapolation, while the reverse involves interpolation, which is generally more effective for machine learning. Extensive experiments with varying group compositions, including sub-sampling by group size and disjoint demographics, reveal substantial performance variation even for GIAA, challenging the assumption that averaging scores eliminates individual subjectivity. Score-distribution analysis using Earth Mover's Distance (EMD) and the Gini index identifies education, photography experience, and art experience as key factors in aesthetic differences, with greater subjectivity in artworks than in photographs. Code is available at https://github.com/lwchen6309/aesthetics_transfer_learning.
中文摘要:本研究通过提出统一的分布模型为图像美学评估建立理论基础,揭示了从通用模型到个性化模型的迁移学习涉及外推过程,而反向过程涉及更有效的内插方法,并通过实验确定了影响美学主观性的关键人口统计因素。
English Summary: This study establishes a theoretical foundation for image aesthetic assessment by proposing a unified distributional model that reveals transfer learning between generic and personalized approaches involves extrapolation and interpolation respectively, with experiments identifying key demographic factors influencing aesthetic subjectivity.

Authors:Julius Broomfield, Kartik Sharma, Srijan Kumar
Title: A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
Abstract:
Large language models (LLMs) have recently demonstrated remarkable advancements in embodying diverse personas, enhancing their effectiveness as conversational agents and virtual assistants. Consequently, LLMs have made significant strides in processing and integrating multimodal information. However, even though human personas can be expressed in both text and image, the extent to which the modality of a persona impacts the embodiment by the LLM remains largely unexplored. In this paper, we investigate how do different modalities influence the expressiveness of personas in multimodal LLMs. To this end, we create a novel modality-parallel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images, where text is visually stylized to convey persona-related attributes. We then create a systematic evaluation framework with 60 questions and corresponding metrics to assess how well LLMs embody each persona across its attributes and scenarios. Comprehensive experiments on $5$ multimodal LLMs show that personas represented by detailed text show more linguistic habits, while typographical images often show more consistency with the persona. Our results reveal that LLMs often overlook persona-specific details conveyed through images, highlighting underlying limitations and paving the way for future research to bridge this gap. We release the data and code at https://github.com/claws-lab/persona-modality .
中文摘要:本研究探讨了不同模态如何影响多模态大语言模型的人格体现,发现文本描述的人格能增强语言习惯,而排版图像则提高一致性,但模型常忽略图像传达的人格细节。
English Summary: This study explores how different modalities affect persona embodiment in multimodal large language models, revealing that text-based personas enhance linguistic habits while typographical images improve consistency, yet models often miss image-conveyed persona details.

Authors:Jonathan Tonglet, Tinne Tuytelaars, Marie-Francine Moens, Iryna Gurevych
Title: Protecting multimodal large language models against misleading visualizations
Abstract:
Visualizations play a pivotal role in daily communication in an increasingly datadriven world. Research on multimodal large language models (MLLMs) for automated chart understanding has accelerated massively, with steady improvements on standard benchmarks. However, for MLLMs to be reliable, they must be robust to misleading visualizations, i.e., charts that distort the underlying data, leading readers to draw inaccurate conclusions that may support disinformation. Here, we uncover an important vulnerability: MLLM questionanswering (QA) accuracy on misleading visualizations drops on average to the level of the random baseline. To address this, we introduce the first inference-time methods to improve QA performance on misleading visualizations, without compromising accuracy on non-misleading ones. We find that two methods, table-based QA and redrawing the visualization, are effective, with improvements of up to 19.6 percentage points. We make our code and data available.
中文摘要:多模态大语言模型在误导性图表面前存在显著漏洞,其问答准确率会降至随机基线水平,但采用基于表格的问答和图表重绘等新型推理时方法可提升多达19.6个百分点的性能,同时不影响正常图表的处理精度。
English Summary: Multimodal large language models exhibit significant vulnerability to misleading visualizations, dropping to random baseline accuracy, but new inference-time methods like table-based QA and chart redrawing can improve performance by up to 19.6 percentage points without affecting standard chart accuracy.

Authors:Tianyi Lorena Yan, Robin Jia
Title: Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
Abstract:
To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets, models, and prompt templates, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both Token Lens, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs' internal components interact with different input tokens to support complex factual recall. Code is available at https://github.com/Lorenayannnnn/how-lms-answer-one-to-many-factual-queries.
中文: 语言模型采用“先促进后抑制”机制,通过注意力与多层感知机先召回全部事实答案再抑制已生成内容,这一机制经Token Lens和敲除分析等实验方法得到验证。
English: Language models employ a promote-then-suppress mechanism, using attention and MLPs to first recall all factual answers and then suppress previously generated ones, as validated through experimental techniques like Token Lens and knockout analysis.

Authors:Yuval Filmus
Title: Aggregation of evaluations without unanimity
Abstract:
Dokow and Holzman determined which predicates over $\{0, 1\}$ satisfy an analog of Arrow's theorem: all unanimous aggregators are dictatorial. Szegedy and Xu, extending earlier work of Dokow and Holzman, extended this to predicates over arbitrary finite alphabets. Mossel extended Arrow's theorem in an orthogonal direction, determining all aggregators without the assumption of unanimity. We bring together both threads of research by extending the results of Dokow-Holzman and Szegedy-Xu to the setting of Mossel. As an application, we determine, for each symmetric predicate over $\{0,1\}$, all of its aggregators.
中文: 本研究将Dokow-Holzman和Szegedy-Xu的研究方向与Mossel对阿罗定理的扩展相结合,在不假设一致性的前提下,确定了{0,1}上对称谓词的所有聚合算子。
English: This work unifies the research threads of Dokow-Holzman and Szegedy-Xu with Mossel's extension of Arrow's theorem, characterizing all aggregators for symmetric predicates over {0,1} without assuming unanimity.

Authors:Yiheng Liu, Xiaohui Gao, Haiyang Sun, Bao Ge, Tianming Liu, Junwei Han, Xintao Hu
Title: Brain-Inspired Exploration of Functional Networks and Key Neurons in Large Language Models
Abstract:
In recent years, the rapid advancement of large language models (LLMs) in natural language processing has sparked significant interest among researchers to understand their mechanisms and functional characteristics. Although existing studies have attempted to explain LLM functionalities by identifying and interpreting specific neurons, these efforts mostly focus on individual neuron contributions, neglecting the fact that human brain functions are realized through intricate interaction networks. Inspired by cognitive neuroscience research on functional brain networks (FBNs), this study introduces a novel approach to investigate whether similar functional networks exist within LLMs. We use methods similar to those in the field of functional neuroimaging analysis to locate and identify functional networks in LLM. Experimental results show that, similar to the human brain, LLMs contain functional networks that frequently recur during operation. Further analysis shows that these functional networks are crucial for LLM performance. Masking key functional networks significantly impairs the model's performance, while retaining just a subset of these networks is adequate to maintain effective operation. This research provides novel insights into the interpretation of LLMs and the lightweighting of LLMs for certain downstream tasks. Code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.
Chinese: 本研究受脑功能网络启发,提出了一种识别大语言模型中重复出现的功能网络的新方法,揭示了这些网络对模型性能的关键作用及其在模型轻量化方面的潜力。
English: This study introduces a novel approach inspired by functional brain networks to identify recurring functional networks in large language models, demonstrating their critical role in model performance and potential for model lightweighting.

Authors:Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury
Title: Multi-Turn Code Generation Through Single-Step Rewards
Abstract:
We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $μ$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $μ$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $μ$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.
Chinese: 提出的$μ$Code方法通过使用单步奖励,训练生成器和验证器基于执行反馈迭代改进代码解决方案,为多轮代码生成提供了一种简单且可扩展的途径,相比现有方法取得了显著性能提升。
English: The proposed $μ$Code method introduces a simple and scalable approach to multi-turn code generation by using single-step rewards, training both a generator and a verifier to iteratively improve code solutions based on execution feedback, achieving significant performance gains over existing methods.

Authors:Albert Gong, Kamilė Stankevičiūtė, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P. Gomes, Kilian Q. Weinberger
Title: PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation
Abstract:
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at https://github.com/kilian-group/phantom-wiki.
中文: PhantomWiki是一种创新的流程,可按需生成独特的文档库和问答对,用于评估大型语言模型,有效解决数据泄露问题,并能够分别评估推理和检索能力。
English: PhantomWiki is a novel pipeline that generates unique, on-demand document corpora and question-answer pairs to evaluate large language models, effectively addressing data leakage and enabling disentangled assessment of reasoning and retrieval capabilities.

Authors:Shuming Liu, Chen Zhao, Fatimah Zohra, Mattia Soldan, Alejandro Pardo, Mengmeng Xu, Lama Alssum, Merey Ramazanova, Juan León Alcázar, Anthony Cioppa, Silvio Giancola, Carlos Hinojosa, Bernard Ghanem
Title: OpenTAD: A Unified Framework and Comprehensive Study of Temporal Action Detection
Abstract:
Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field has achieved remarkable progress in recent years, further progress and real-world applications are impeded by the absence of a standardized framework. Currently, different methods are compared under different implementation settings, evaluation protocols, etc., making it difficult to assess the real effectiveness of a specific technique. To address this issue, we propose \textbf{OpenTAD}, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular codebase. In OpenTAD, minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two. OpenTAD also facilitates straightforward benchmarking across various datasets and enables fair and in-depth comparisons among different methods. With OpenTAD, we comprehensively study how innovations in different network components affect detection performance and identify the most effective design choices through extensive experiments. This study has led to a new state-of-the-art TAD method built upon existing techniques for each component. We have made our code and models available at https://github.com/sming256/OpenTAD.
中文摘要:OpenTAD是一个统一框架,通过整合多种时序动作检测方法和数据集实现标准化,借助模块化设计促进公平比较并达到了最先进的检测性能。
English Summary: OpenTAD is a unified framework that standardizes temporal action detection by integrating multiple methods and datasets, enabling fair comparisons and achieving state-of-the-art performance through modular design.

Authors:Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi
Title: UniTok: A Unified Tokenizer for Visual Generation and Understanding
Abstract:
Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256$\times$256 benchmark. GitHub: https://github.com/FoundationVision/UniTok.
Chinese: UniTok通过创新的多码本量化机制构建统一分词器,有效扩展词汇量和瓶颈维度,在图像生成与理解任务中均创下新记录,且能无缝集成到多模态大模型中实现原生视觉生成能力。
English: UniTok introduces a unified tokenizer with a multi-codebook quantization mechanism that overcomes representational limitations in discrete token spaces, achieving state-of-the-art performance in both image generation and understanding tasks without inherent conflicts between objectives.

Authors:Yongjia Lei, Haoyu Han, Ryan A. Rossi, Franck Dernoncourt, Nedim Lipka, Mahantesh M Halappanavar, Jiliang Tang, Yu Wang
Title: Mixture of Structural-and-Textual Retrieval over Text-rich Graph Knowledge Bases
Abstract:
Text-rich Graph Knowledge Bases (TG-KBs) have become increasingly crucial for answering queries by providing textual and structural knowledge. However, current retrieval methods often retrieve these two types of knowledge in isolation without considering their mutual reinforcement and some hybrid methods even bypass structural retrieval entirely after neighboring aggregation. To fill in this gap, we propose a Mixture of Structural-and-Textual Retrieval (MoR) to retrieve these two types of knowledge via a Planning-Reasoning-Organizing framework. In the Planning stage, MoR generates textual planning graphs delineating the logic for answering queries. Following planning graphs, in the Reasoning stage, MoR interweaves structural traversal and textual matching to obtain candidates from TG-KBs. In the Organizing stage, MoR further reranks fetched candidates based on their structural trajectory. Extensive experiments demonstrate the superiority of MoR in harmonizing structural and textual retrieval with insights, including uneven retrieving performance across different query logics and the benefits of integrating structural trajectories for candidate reranking. Our code is available at https://github.com/Yoega/MoR.
Chinese: 提出的结构-文本混合检索框架通过规划、推理和组织三个阶段,将图结构遍历与文本匹配相结合,有效提升了富文本图知识库的检索效果,展现出在协调结构性和文本性知识方面的优越性能。
English: The proposed Mixture of Structural-and-Textual Retrieval (MoR) framework integrates structural traversal and textual matching through planning, reasoning, and organizing stages to enhance retrieval from Text-rich Graph Knowledge Bases, demonstrating superior performance in harmonizing both knowledge types.

Authors:Xiuli Bi, Jianfei Yuan, Bo Liu, Yong Zhang, Xiaodong Cun, Chi-Man Pun, Bin Xiao
Title: Mobius: Text to Seamless Looping Video Generation via Latent Shift
Abstract:
We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations, thereby creating new visual materials for the multi-media presentation. Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training. During inference, we first construct a latent cycle by connecting the starting and ending noise of the videos. Given that the temporal consistency can be maintained by the context of the video diffusion model, we perform multi-frame latent denoising by gradually shifting the first-frame latent to the end in each step. As a result, the denoising context varies in each step while maintaining consistency throughout the inference process. Moreover, the latent cycle in our method can be of any length. This extends our latent-shifting approach to generate seamless looping videos beyond the scope of the video diffusion model's context. Unlike previous cinemagraphs, the proposed method does not require an image as appearance, which will restrict the motions of the generated results. Instead, our method can produce more dynamic motion and better visual quality. We conduct multiple experiments and comparisons to verify the effectiveness of the proposed method, demonstrating its efficacy in different scenarios. All the code will be made available.
Chinese: Mobius提出了一种无需训练即可从文本描述直接生成无缝循环视频的新方法,通过预训练的视频潜在扩散模型和潜在循环结构,确保时间一致性和动态视觉效果。
English: Mobius introduces a novel method for generating seamlessly looping videos from text descriptions without training, utilizing a pre-trained video latent diffusion model and a latent cycle approach to ensure temporal consistency and dynamic motion.

Authors:Qingsen Yan, Yixu Feng, Cheng Zhang, Guansong Pang, Kangbiao Shi, Peng Wu, Wei Dong, Jinqiu Sun, Yanning Zhang
Title: HVI: A New Color Space for Low-light Image Enhancement
Abstract:
Low-Light Image Enhancement (LLIE) is a crucial computer vision task that aims to restore detailed visual information from corrupted low-light images. Many existing LLIE methods are based on standard RGB (sRGB) space, which often produce color bias and brightness artifacts due to inherent high color sensitivity in sRGB. While converting the images using Hue, Saturation and Value (HSV) color space helps resolve the brightness issue, it introduces significant red and black noise artifacts. To address this issue, we propose a new color space for LLIE, namely Horizontal/Vertical-Intensity (HVI), defined by polarized HS maps and learnable intensity. The former enforces small distances for red coordinates to remove the red artifacts, while the latter compresses the low-light regions to remove the black artifacts. To fully leverage the chromatic and intensity information, a novel Color and Intensity Decoupling Network (CIDNet) is further introduced to learn accurate photometric mapping function under different lighting conditions in the HVI space. Comprehensive results from benchmark and ablation experiments show that the proposed HVI color space with CIDNet outperforms the state-of-the-art methods on 10 datasets. The code is available at https://github.com/Fediory/HVI-CIDNet.
中文摘要:提出的HVI色彩空间与CIDNet相结合,能有效消除低光图像增强中的红黑伪影,在多个数据集上实现卓越性能。
English Summary: The proposed HVI color space combined with CIDNet effectively eliminates red and black artifacts in low-light image enhancement, achieving superior performance across multiple datasets.

Authors:Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen
Title: Vector-Quantized Vision Foundation Models for Object-Centric Learning
Abstract:
Object-Centric Learning (OCL) aggregates image or video feature maps into object-level feature vectors, termed \textit{slots}. It's self-supervision of reconstructing the input from slots struggles with complex object textures, thus Vision Foundation Model (VFM) representations are used as the aggregation input and reconstruction target. Existing methods leverage VFM representations in diverse ways yet fail to fully exploit their potential. In response, we propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO). The key to our unification is simply shared quantizing VFM representations in OCL aggregation and decoding. Experiments show that across different VFMs, aggregators and decoders, our VVO consistently outperforms baselines in object discovery and recognition, as well as downstream visual prediction and reasoning. We also mathematically analyze why VFM representations facilitate OCL aggregation and why their shared quantization as reconstruction targets strengthens OCL supervision. Our source code and model checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.
中文: 本文提出VQ-VFM-OCL(VVO)统一架构,通过在对象中心学习的聚合和解码中共享量化视觉基础模型表示,在物体发现、识别及下游任务中持续超越基线,并对其有效性进行了数学分析。
English: This paper introduces VQ-VFM-OCL (VVO), a unified architecture that enhances Object-Centric Learning by sharing quantized Vision Foundation Model representations across aggregation and decoding, consistently outperforming baselines in object discovery, recognition, and downstream tasks while providing mathematical analysis of its effectiveness.

Authors:Yang Zhou, Xu Gao, Zichong Chen, Hui Huang
Title: Attention Distillation: A Unified Approach to Visual Characteristics Transfer
Abstract:
Recent advances in generative diffusion models have shown a notable inherent understanding of image style and semantics. In this paper, we leverage the self-attention features from pretrained diffusion networks to transfer the visual characteristics from a reference to generated images. Unlike previous work that uses these features as plug-and-play attributes, we propose a novel attention distillation loss calculated between the ideal and current stylization results, based on which we optimize the synthesized image via backpropagation in latent space. Next, we propose an improved Classifier Guidance that integrates attention distillation loss into the denoising sampling process, further accelerating the synthesis and enabling a broad range of image generation applications. Extensive experiments have demonstrated the extraordinary performance of our approach in transferring the examples' style, appearance, and texture to new images in synthesis. Code is available at https://github.com/xugao97/AttentionDistillation.
Chinese: 本文提出了一种新颖的注意力蒸馏损失和改进的分类器引导方法,利用预训练扩散模型的自注意力特征,将参考图像的视觉特征有效迁移至生成图像,在加速合成过程的同时实现了卓越的风格与纹理迁移效果。
English: This paper introduces a novel attention distillation loss and enhanced Classifier Guidance method that leverages self-attention features from pretrained diffusion models to effectively transfer visual characteristics from reference images to generated ones, achieving superior style and texture synthesis with accelerated performance.

Authors:Mattéo Clémot, Julie Digne, Julien Tierny
Title: Topological Autoencoders++: Fast and Accurate Cycle-Aware Dimensionality Reduction
Abstract:
This paper presents a novel topology-aware dimensionality reduction approach aiming at accurately visualizing the cyclic patterns present in high dimensional data. To that end, we build on the Topological Autoencoders (TopoAE) formulation. First, we provide a novel theoretical analysis of its associated loss and show that a zero loss indeed induces identical persistence pairs (in high and low dimensions) for the $0$-dimensional persistent homology (PH$^0$) of the Rips filtration. We also provide a counter example showing that this property no longer holds for a naive extension of TopoAE to PH$^d$ for $d\ge 1$. Based on this observation, we introduce a novel generalization of TopoAE to $1$-dimensional persistent homology (PH$^1$), called TopoAE++, for the accurate generation of cycle-aware planar embeddings, addressing the above failure case. This generalization is based on the notion of cascade distortion, a new penalty term favoring an isometric embedding of the $2$-chains filling persistent $1$-cycles, hence resulting in more faithful geometrical reconstructions of the $1$-cycles in the plane. We further introduce a novel, fast algorithm for the exact computation of PH for Rips filtrations in the plane, yielding improved runtimes over previously documented topology-aware methods. Our method also achieves a better balance between the topological accuracy, as measured by the Wasserstein distance, and the visual preservation of the cycles in low dimensions. Our C++ implementation is available at https://github.com/MClemot/TopologicalAutoencodersPlusPlus.
中文: 本文提出了TopoAE++,一种改进的拓扑感知降维方法,通过将拓扑自编码器推广至一维持续性同调并引入级联失真惩罚项,能精确可视化高维数据中的循环模式,在提升计算效率的同时更好地平衡了拓扑精度与视觉保持。
English: This paper introduces TopoAE++, an enhanced topology-aware dimensionality reduction method that accurately visualizes cyclic patterns in high-dimensional data by generalizing Topological Autoencoders to 1-dimensional persistent homology with a novel cascade distortion penalty, achieving improved runtime and better balance between topological accuracy and visual preservation.

Authors:Mattéo Clémot, Julie Digne, Julien Tierny
Title: Topological Autoencoders++: Fast and Accurate Cycle-Aware Dimensionality Reduction
Abstract:
This paper presents a novel topology-aware dimensionality reduction approach aiming at accurately visualizing the cyclic patterns present in high dimensional data. To that end, we build on the Topological Autoencoders (TopoAE) formulation. First, we provide a novel theoretical analysis of its associated loss and show that a zero loss indeed induces identical persistence pairs (in high and low dimensions) for the $0$-dimensional persistent homology (PH$^0$) of the Rips filtration. We also provide a counter example showing that this property no longer holds for a naive extension of TopoAE to PH$^d$ for $d\ge 1$. Based on this observation, we introduce a novel generalization of TopoAE to $1$-dimensional persistent homology (PH$^1$), called TopoAE++, for the accurate generation of cycle-aware planar embeddings, addressing the above failure case. This generalization is based on the notion of cascade distortion, a new penalty term favoring an isometric embedding of the $2$-chains filling persistent $1$-cycles, hence resulting in more faithful geometrical reconstructions of the $1$-cycles in the plane. We further introduce a novel, fast algorithm for the exact computation of PH for Rips filtrations in the plane, yielding improved runtimes over previously documented topology-aware methods. Our method also achieves a better balance between the topological accuracy, as measured by the Wasserstein distance, and the visual preservation of the cycles in low dimensions. Our C++ implementation is available at https://github.com/MClemot/TopologicalAutoencodersPlusPlus.
中文: 本文提出了TopoAE++,一种改进的拓扑感知降维方法,通过将拓扑自编码器推广至一维持续性同调并引入级联失真惩罚项,能精确可视化高维数据中的循环模式,在提升计算效率的同时更好地平衡了拓扑精度与视觉保持。
English: This paper introduces TopoAE++, an enhanced topology-aware dimensionality reduction method that accurately visualizes cyclic patterns in high-dimensional data by generalizing Topological Autoencoders to 1-dimensional persistent homology with a novel cascade distortion penalty, achieving improved runtime and better balance between topological accuracy and visual preservation.

Authors:Zhouyu He, Peng Qiao, Rongchun Li, Yong Dou, Yusong Tan
Title: Highly Parallelized Reinforcement Learning Training with Relaxed Assignment Dependencies
Abstract:
As the demands for superior agents grow, the training complexity of Deep Reinforcement Learning (DRL) becomes higher. Thus, accelerating training of DRL has become a major research focus. Dividing the DRL training process into subtasks and using parallel computation can effectively reduce training costs. However, current DRL training systems lack sufficient parallelization due to data assignment between subtask components. This assignment issue has been ignored, but addressing it can further boost training efficiency. Therefore, we propose a high-throughput distributed RL training system called TianJi. It relaxes assignment dependencies between subtask components and enables event-driven asynchronous communication. Meanwhile, TianJi maintains clear boundaries between subtask components. To address convergence uncertainty from relaxed assignment dependencies, TianJi proposes a distributed strategy based on the balance of sample production and consumption. The strategy controls the staleness of samples to correct their quality, ensuring convergence. We conducted extensive experiments. TianJi achieves a convergence time acceleration ratio of up to 4.37 compared to related comparison systems. When scaled to eight computational nodes, TianJi shows a convergence time speedup of 1.6 and a throughput speedup of 7.13 relative to XingTian, demonstrating its capability to accelerate training and scalability. In data transmission efficiency experiments, TianJi significantly outperforms other systems, approaching hardware limits. TianJi also shows effectiveness in on-policy algorithms, achieving convergence time acceleration ratios of 4.36 and 2.95 compared to RLlib and XingTian. TianJi is accessible at https://github.com/HiPRL/TianJi.git.
中文: 提出的TianJi系统通过解除子任务间的分配依赖并采用事件驱动异步通信,将深度强化学习训练加速高达4.37倍,同时保持系统可扩展性和传输效率。
English: The proposed TianJi system accelerates Deep Reinforcement Learning training by relaxing assignment dependencies between subtasks and implementing event-driven asynchronous communication, achieving up to 4.37 times faster convergence while maintaining scalability and transmission efficiency.

Authors:Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang
Title: Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
Abstract:
The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.
Chinese: Dream Engine框架通过将多模态编码器与扩散模型结合,采用两阶段训练方法,有效解决了图像生成中文本与图像交替控制的统一性问题,并在性能上达到先进水平。
English: The Dream Engine framework addresses the lack of unified text-image interleaved control in image generation by integrating multimodal encoders like QwenVL with diffusion models, achieving competitive performance through a two-stage training approach.

Authors:Yating Yu, Congqi Cao, Yifan Zhang, Yanning Zhang
Title: Learning to Generalize without Bias for Open-Vocabulary Action Recognition
Abstract:
Leveraging the effective visual-text alignment and static generalizability from CLIP, recent video learners adopt CLIP initialization with further regularization or recombination for generalization in open-vocabulary action recognition in-context. However, due to the static bias of CLIP, such video learners tend to overfit on shortcut static features, thereby compromising their generalizability, especially to novel out-of-context actions. To address this issue, we introduce Open-MeDe, a novel Meta-optimization framework with static Debiasing for Open-vocabulary action recognition. From a fresh perspective of generalization, Open-MeDe adopts a meta-learning approach to improve known-to-open generalizing and image-to-video debiasing in a cost-effective manner. Specifically, Open-MeDe introduces a cross-batch meta-optimization scheme that explicitly encourages video learners to quickly generalize to arbitrary subsequent data via virtual evaluation, steering a smoother optimization landscape. In effect, the free of CLIP regularization during optimization implicitly mitigates the inherent static bias of the video meta-learner. We further apply self-ensemble over the optimization trajectory to obtain generic optimal parameters that can achieve robust generalization to both in-context and out-of-context novel data. Extensive evaluations show that Open-MeDe not only surpasses state-of-the-art regularization methods tailored for in-context open-vocabulary action recognition but also substantially excels in out-of-context scenarios.Code is released at https://github.com/Mia-YatingYu/Open-MeDe.
中文摘要:Open-MeDe是一种新颖的元优化框架,通过跨批次元优化方案和自集成方法,有效减轻CLIP的静态偏差,显著提升了视频学习器在上下文内外开放词汇动作识别中的泛化能力。
English Summary: Open-MeDe is a meta-optimization framework that addresses CLIP's static bias in video learners by employing cross-batch meta-optimization and self-ensemble techniques to enhance generalization for both in-context and out-of-context open-vocabulary action recognition.

Authors:Yifan Jia, Xingda Yu, Zhengyang Ji, Songning Lai, Yutao Yue
Title: Adaptive H&E-IHC information fusion staining framework based on feature extra
Abstract:
Immunohistochemistry (IHC) staining plays a significant role in the evaluation of diseases such as breast cancer. The H&E-to-IHC transformation based on generative models provides a simple and cost-effective method for obtaining IHC images. Although previous models can perform digital coloring well, they still suffer from (i) coloring only through the pixel features that are not prominent in HE, which is easy to cause information loss in the coloring process; (ii) The lack of pixel-perfect H&E-IHC groundtruth pairs poses a challenge to the classical L1 loss.To address the above challenges, we propose an adaptive information enhanced coloring framework based on feature extractors. We first propose the VMFE module to effectively extract the color information features using multi-scale feature extraction and wavelet transform convolution, while combining the shared decoder for feature fusion. The high-performance dual feature extractor of H&E-IHC is trained by contrastive learning, which can effectively perform feature alignment of HE-IHC in high latitude space. At the same time, the trained feature encoder is used to enhance the features and adaptively adjust the loss in the HE section staining process to solve the problems related to unclear and asymmetric information. We have tested on different datasets and achieved excellent performance.Our code is available at https://github.com/babyinsunshine/CEFF
中文: 本研究提出了一种自适应信息增强染色框架,通过多尺度特征提取和对比学习优化H&E到IHC的图像转换,有效解决了特征丢失和对齐问题,在多个数据集上取得了优异性能。
English: This study introduces an adaptive information-enhanced coloring framework that utilizes multi-scale feature extraction and contrastive learning to improve H&E-to-IHC image transformation by addressing feature loss and alignment issues, achieving superior performance on various datasets.

Authors:Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin
Title: Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking
Abstract:
Chain-of-thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. Our key contributions are: (1) We evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit (a subset of model components, responsible for tracking the world state), indicating that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three challenging settings: skipping intermediate steps, introducing data noises, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSAs), highlighting its resilience in challenging scenarios. Our code is available at https://github.com/IvanChangPKU/FSA.
中文: 思维链(CoT)通过激活后层MLP神经元形成隐式有限状态自动机,显著增强Transformer模型的性能,并在复杂场景中展现出强大鲁棒性。
English: Chain-of-thought (CoT) boosts Transformer model performance by enabling implicit finite state automata through late-layer MLP neurons, demonstrating robustness in challenging scenarios.

Authors:Lin Zhang, Yi Tian, XiYun Wang, Wanru Xu, Yi Jin, Yaping Huang
Title: Differential Contrastive Training for Gaze Estimation
Abstract:
The complex application scenarios have raised critical requirements for precise and generalizable gaze estimation methods. Recently, the pre-trained CLIP has achieved remarkable performance on various vision tasks, but its potentials have not been fully exploited in gaze estimation. In this paper, we propose a novel Differential Contrastive Training strategy, which boosts gaze estimation performance with the help of the CLIP. Accordingly, a Differential Contrastive Gaze Estimation network (DCGaze) composed of a Visual Appearance-aware branch and a Semantic Differential-aware branch is introduced. The Visual Appearance-aware branch is essentially a primary gaze estimation network and it incorporates an Adaptive Feature-refinement Unit (AFU) and a Double-head Gaze Regressor (DGR), which both help the primary network to extract informative and gaze-related appearance features. Moreover, the Semantic Difference-aware branch is designed on the basis of the CLIP's text encoder to reveal the semantic difference of gazes. This branch could further empower the Visual Appearance-aware branch with the capability of characterizing the gaze-related semantic information. Extensive experimental results on four challenging datasets over within and cross-domain tasks demonstrate the effectiveness of our DCGaze.The code is available at https://github.com/LinZhang-bjtu/DCGaze.
中文摘要:本文提出了一种利用CLIP的差分对比训练策略,通过结合视觉外观感知和语义差异感知的DCGaze网络,显著提升了跨域视线估计任务的性能表现。
English Summary: This paper introduces a Differential Contrastive Training strategy leveraging CLIP to enhance gaze estimation, resulting in the DCGaze network that integrates visual appearance and semantic difference awareness for improved performance across challenging datasets.

Authors:Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun
Title: Self-Training Elicits Concise Reasoning in Large Language Models
Abstract:
Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning
中文: 思维链推理常产生冗余标记,但通过基于自生成简洁路径的微调,模型能在保持准确性的同时平均减少30%的输出长度。
English: Chain-of-thought reasoning in LLMs often produces redundant tokens, but by fine-tuning with self-generated concise paths, models can reduce output length by 30% while maintaining accuracy.

Authors:Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, Luc Van Gool
Title: UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler
Abstract:
Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at https://github.com/lpiccinelli-eth/UniDepth
中文: UniDepthV2是一种新颖的单目测距深度估计模型,能够直接从单张图像预测三维点,通过改进的边缘引导损失和几何不变性实现跨领域的卓越泛化能力,无需额外信息。
English: UniDepthV2 is a novel model that enables accurate monocular metric depth estimation across domains by directly predicting 3D points from single images, featuring enhanced edge precision and geometric invariance for superior generalization without requiring additional data.

Authors:Joris J. Weeda, Saray Bakker, Gang Chen, Javier Alonso-Mora
Title: Pushing Through Clutter With Movability Awareness of Blocking Obstacles
Abstract:
Navigation Among Movable Obstacles (NAMO) poses a challenge for traditional path-planning methods when obstacles block the path, requiring push actions to reach the goal. We propose a framework that enables movability-aware planning to overcome this challenge without relying on explicit obstacle placement. Our framework integrates a global Semantic Visibility Graph and a local Model Predictive Path Integral (SVG-MPPI) approach to efficiently sample rollouts, taking into account the continuous range of obstacle movability. A physics engine is adopted to simulate the interaction result of the rollouts with the environment, and generate trajectories that minimize contact force. In qualitative and quantitative experiments, SVG-MPPI outperforms the existing paradigm that uses only binary movability for planning, achieving higher success rates with reduced cumulative contact forces. Our code is available at: https://github.com/tud-amr/SVG-MPPI
Chinese: 提出的SVG-MPPI框架通过整合全局语义图与局部路径采样及物理模拟,实现了对可移动障碍物的运动感知规划,相比二元可移动性方法以更高成功率和更小接触力表现更优。
English: The proposed SVG-MPPI framework enables movability-aware planning for Navigation Among Movable Obstacles by integrating global semantic graphs with local path sampling and physics simulation, outperforming binary movability approaches with higher success rates and reduced contact forces.

Authors:Xuzheng Yang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, Heng Tao Shen
Title: New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration
Abstract:
Referring Expression Comprehension (REC) is a foundational cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding. It serves as an essential testing ground for Multimodal Large Language Models (MLLMs). To advance this field, we introduced a new REC dataset in our previous conference paper, characterized by two key features. First, it is designed with controllable difficulty levels, requiring multi-level fine-grained reasoning across object categories, attributes, and multi-hop relationships. Second, it incorporates negative text and images generated through fine-grained editing and augmentation, explicitly testing a model's ability to reject scenarios where the target object is absent, an often overlooked yet critical challenge in existing datasets. In this extended work, we propose two new methods to tackle the challenges of fine-grained REC by combining the strengths of Specialist Models and MLLMs. The first method adaptively assigns simple cases to faster, lightweight models and reserves complex ones for powerful MLLMs, balancing accuracy and efficiency. The second method lets a specialist generate a set of possible object regions, and the MLLM selects the most plausible one using its reasoning ability. These collaborative strategies lead to significant improvements on our dataset and other challenging benchmarks. Our results show that combining specialized and general-purpose models offers a practical path toward solving complex real-world vision-language tasks. Our dataset and code are available at https://github.com/sleepyshep/FineCops-Ref.
中文: 本研究提出了一个具有可控难度和负样本的指代表达理解数据集,并设计了两种结合专家模型与多模态大语言模型的协作方法,通过优势互补在保证效率的同时显著提升了细粒度推理能力。
English: This work introduces a dataset for Referring Expression Comprehension that features controllable difficulty and negative examples, and proposes two collaborative methods combining Specialist Models and Multimodal Large Language Models to enhance fine-grained reasoning while balancing accuracy and efficiency.

Authors:Gilles Van De Vyver, Aksel Try Lenz, Erik Smistad, Sindre Hellum Olaisen, Bjørnar Grenne, Espen Holte, Håavard Dalen, Lasse Løvstakken
Title: Generative augmentations for improved cardiac ultrasound segmentation using diffusion models
Abstract:
One of the main challenges in current research on segmentation in cardiac ultrasound is the lack of large and varied labeled datasets and the differences in annotation conventions between datasets. This makes it difficult to design robust segmentation models that generalize well to external datasets. This work utilizes diffusion models to create generative augmentations that can significantly improve diversity of the dataset and thus the generalisability of segmentation models without the need for more annotated data. The augmentations are applied in addition to regular augmentations. A visual test survey showed that experts cannot clearly distinguish between real and fully generated images. Using the proposed generative augmentations, segmentation robustness was increased when training on an internal dataset and testing on an external dataset with an improvement of over 20 millimeters in Hausdorff distance. Additionally, the limits of agreement for automatic ejection fraction estimation improved by up to 20% of absolute ejection fraction value on out of distribution cases. These improvements come exclusively from the increased variation of the training data using the generative augmentations, without modifying the underlying machine learning model. The augmentation tool is available as an open source Python library at https://github.com/GillesVanDeVyver/EchoGAINS.
中文: 本研究利用扩散模型生成多样化心脏超声图像,无需额外标注即可提升分割模型的鲁棒性和泛化能力,在外部分割测试中实现了显著性能提升。
English: This study uses diffusion models to generate diverse cardiac ultrasound images, enhancing segmentation model robustness and generalization without additional annotations, achieving significant improvements in external dataset performance.

Authors:Mingjie Wu, Chenggui Yang, Huihua Wang, Chen Xue, Yibo Wang, Haoyu Wang, Yansong Wang, Can Peng, Yuqi Han, Ruoyu Li, Lijun Yun, Zaiqing Chen, Yuelong Xia
Title: WalnutData: A UAV Remote Sensing Dataset of Green Walnuts and Model Evaluation
Abstract:
The UAV technology is gradually maturing and can provide extremely powerful support for smart agriculture and precise monitoring. Currently, there is no dataset related to green walnuts in the field of agricultural computer vision. Thus, in order to promote the algorithm design in the field of agricultural computer vision, we used UAV to collect remote-sensing data from 8 walnut sample plots. Considering that green walnuts are subject to various lighting conditions and occlusion, we constructed a large-scale dataset with a higher-granularity of target features - WalnutData. This dataset contains a total of 30,240 images and 706,208 instances, and there are 4 target categories: being illuminated by frontal light and unoccluded (A1), being backlit and unoccluded (A2), being illuminated by frontal light and occluded (B1), and being backlit and occluded (B2). Subsequently, we evaluated many mainstream algorithms on WalnutData and used these evaluation results as the baseline standard. The dataset and all evaluation results can be obtained at https://github.com/1wuming/WalnutData.
Chinese: 本研究推出了WalnutData数据集,通过无人机采集了30,240张图像、706,208个青核桃实例,涵盖不同光照与遮挡条件,为农业计算机视觉算法提供了评估基准。
English: This study introduces WalnutData, a large-scale UAV-collected dataset of 30,240 images with 706,208 instances of green walnuts under varying light and occlusion conditions, establishing evaluation baselines for agricultural computer vision algorithms.

Authors:Meng Lou, Yizhou Yu
Title: OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
Abstract:
Top-down attention plays a crucial role in the human vision system, wherein the brain initially obtains a rough overview of a scene to discover salient cues (i.e., overview first), followed by a more careful finer-grained examination (i.e., look closely next). However, modern ConvNets remain confined to a pyramid structure that successively downsamples the feature map for receptive field expansion, neglecting this crucial biomimetic principle. We present OverLoCK, the first pure ConvNet backbone architecture that explicitly incorporates a top-down attention mechanism. Unlike pyramid backbone networks, our design features a branched architecture with three synergistic sub-networks: 1) a Base-Net that encodes low/mid-level features; 2) a lightweight Overview-Net that generates dynamic top-down attention through coarse global context modeling (i.e., overview first); and 3) a robust Focus-Net that performs finer-grained perception guided by top-down attention (i.e., look closely next). To fully unleash the power of top-down attention, we further propose a novel context-mixing dynamic convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases, addressing critical limitations in existing convolutions. Our OverLoCK exhibits a notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2%, significantly surpassing ConvNeXt-B while using only around one-third of the FLOPs/parameters. On object detection, our OverLoCK-S clearly surpasses MogaNet-B by 1% in AP^b. On semantic segmentation, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.
中文摘要:OverLoCK是首个采用分支架构模拟人类自上而下注意力的纯卷积网络,通过三个协同子网络实现"先概览后细察"的机制,在多项视觉任务中显著超越现有方法且计算效率更高。
English Summary: OverLoCK is the first pure ConvNet backbone that mimics human top-down attention through a branched architecture with three sub-networks, achieving superior performance across multiple vision tasks while being more computationally efficient than existing methods.

Authors:Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang
Title: LongRoPE2: Near-Lossless LLM Context Window Scaling
Abstract:
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.
中文: LongRoPE2是一种新颖方法,通过进化搜索的RoPE缩放算法和混合上下文窗口训练,在保持原始短上下文性能的同时,将预训练大语言模型的有效上下文窗口扩展至目标长度。
English: LongRoPE2 is a novel method that extends LLMs' effective context window to target lengths while maintaining original short-context performance through addressing insufficient RoPE dimension training with evolutionary search-based rescaling and mixed context window fine-tuning.

Authors:Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, Xiaojie Wang
Title: Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
Abstract:
Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-source benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.
中文: 本文提出基于Overcooked-AI开发的Collab-Overcooked多智能体基准测试,通过开放式协作任务和过程导向评估指标,揭示大语言模型虽在目标理解表现优异,但在主动协作和持续适应方面存在明显不足。
English: This paper introduces Collab-Overcooked, a novel LLM-based multi-agent benchmark built on Overcooked-AI that features enhanced collaborative tasks and process-oriented evaluation metrics, revealing LLMs' strengths in goal interpretation but deficiencies in active collaboration and adaptation.

Authors:Yejun Zhang, Shuzhe Wang, Juho Kannala
Title: A2-GNN: Angle-Annular GNN for Visual Descriptor-free Camera Relocalization
Abstract:
Visual localization involves estimating the 6-degree-of-freedom (6-DoF) camera pose within a known scene. A critical step in this process is identifying pixel-to-point correspondences between 2D query images and 3D models. Most advanced approaches currently rely on extensive visual descriptors to establish these correspondences, facing challenges in storage, privacy issues and model maintenance. Direct 2D-3D keypoint matching without visual descriptors is becoming popular as it can overcome those challenges. However, existing descriptor-free methods suffer from low accuracy or heavy computation. Addressing this gap, this paper introduces the Angle-Annular Graph Neural Network (A2-GNN), a simple approach that efficiently learns robust geometric structural representations with annular feature extraction. Specifically, this approach clusters neighbors and embeds each group's distance information and angle as supplementary information to capture local structures. Evaluation on matching and visual localization datasets demonstrates that our approach achieves state-of-the-art accuracy with low computational overhead among visual description-free methods. Our code will be released on https://github.com/YejunZhang/a2-gnn.
中文: 本文提出的A2-GNN方法通过几何结构表征实现无描述符的视觉定位,在保证高精度的同时显著降低计算开销,性能优于现有同类方法。
English: This paper introduces A2-GNN, a descriptor-free method that uses geometric structural representations for efficient and accurate visual localization, achieving top performance with low computational costs.

Authors:Xuyang Wei, Chunlin Tian, Li Li
Title: AsymLoRA: Harmonizing Data Conflicts and Commonalities in MLLMs
Abstract:
Effective instruction fine-tuning on diverse image-text datasets is crucial for developing a versatile Multimodal Large Language Model (MLLM), where dataset composition dictates the model's adaptability across multimodal tasks. However, complex datasets often contain inherent conflicts -- stemming from modality-specific optimization objectives -- and latent commonalities that enable cross-task transfer, which most existing approaches handle separately. To bridge this gap, we introduce AsymLoRA, a parameter-efficient tuning framework that unifies knowledge modularization and cross-modal coordination via asymmetric LoRA: task-specific low-rank projections (matrix B) that preserve distinct adaptation pathways for conflicting objectives, and a shared projection (matrix A) that consolidates cross-modal commonalities. Extensive evaluations demonstrate that AsymLoRA consistently surpasses both vanilla LoRA, which captures only commonalities, and LoRA-MoE, which focuses solely on conflicts, achieving superior model performance and system efficiency across diverse benchmarks.\href{Code}{https://github.com/Clin0212/HydraLoRA/blob/main/MLLM-HydraLoRA/README.md}.
中文摘要:AsymLoRA框架通过非对称低秩自适应技术,在解决多模态任务冲突目标的同时有效利用跨模态共性,其综合性能与系统效率均优于现有方法。
English Summary: The AsymLoRA framework enhances multimodal large language models by using asymmetric low-rank adaptation to simultaneously manage conflicting objectives and leverage cross-modal commonalities, outperforming existing methods in both performance and efficiency.

Authors:Guannan Lai, Yujie Li, Xiangkun Wang, Junbo Zhang, Tianrui Li, Xin Yang
Title: Order-Robust Class Incremental Learning: Graph-Driven Dynamic Similarity Grouping
Abstract:
Class Incremental Learning (CIL) aims to enable models to learn new classes sequentially while retaining knowledge of previous ones. Although current methods have alleviated catastrophic forgetting (CF), recent studies highlight that the performance of CIL models is highly sensitive to the order of class arrival, particularly when sequentially introduced classes exhibit high inter-class similarity. To address this critical yet understudied challenge of class order sensitivity, we first extend existing CIL frameworks through theoretical analysis, proving that grouping classes with lower pairwise similarity during incremental phases significantly improves model robustness to order variations. Building on this insight, we propose Graph-Driven Dynamic Similarity Grouping (GDDSG), a novel method that employs graph coloring algorithms to dynamically partition classes into similarity-constrained groups. Each group trains an isolated CIL sub-model and constructs meta-features for class group identification. Experimental results demonstrate that our method effectively addresses the issue of class order sensitivity while achieving optimal performance in both model accuracy and anti-forgetting capability. Our code is available at https://github.com/AIGNLAI/GDDSG.
中文: 本文提出图驱动的动态相似性分组方法(GDDSG),通过图着色算法在增量学习中动态划分相似性约束的类别组,有效解决了类别顺序敏感性问题,同时保持了高精度和抗遗忘能力。
English: This paper introduces Graph-Driven Dynamic Similarity Grouping (GDDSG), a novel method that uses graph coloring to dynamically group classes by similarity during incremental learning, effectively mitigating class order sensitivity while maintaining high accuracy and anti-forgetting performance.

Authors:Fan Yang, Dongsheng Luo, Wenrui Chen, Jiacheng Lin, Junjie Cai, Kailun Yang, Zhiyong Li, Yaonan Wang
Title: Multi-Keypoint Affordance Representation for Functional Dexterous Grasping
Abstract:
Functional dexterous grasping requires precise hand-object interaction, going beyond simple gripping. Existing affordance-based methods primarily predict coarse interaction regions and cannot directly constrain the grasping posture, leading to a disconnection between visual perception and manipulation. To address this issue, we propose a multi-keypoint affordance representation for functional dexterous grasping, which directly encodes task-driven grasp configurations by localizing functional contact points. Our method introduces Contact-guided Multi-Keypoint Affordance (CMKA), leveraging human grasping experience images for weak supervision combined with Large Vision Models for fine affordance feature extraction, achieving generalization while avoiding manual keypoint annotations. Additionally, we present a Keypoint-based Grasp matrix Transformation (KGT) method, ensuring spatial consistency between hand keypoints and object contact points, thus providing a direct link between visual perception and dexterous grasping actions. Experiments on public real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks demonstrate that our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks, bridging the gap between visual affordance learning and dexterous robotic manipulation. The source code and demo videos are publicly available at https://github.com/PopeyePxx/MKA.
中文摘要:本文提出一种多关键点可供性表示方法,通过功能接触点定位直接编码抓取配置,结合人类抓取经验与视觉模型特征提取,有效连接视觉感知与灵巧操作,显著提升了抓取精度与泛化能力。
English Summary: This paper introduces a multi-keypoint affordance representation method that bridges visual perception and dexterous manipulation by directly encoding grasp configurations through functional contact point localization, achieving improved accuracy and generalization in robotic grasping tasks.

Authors:Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao
Title: Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models
Abstract:
In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation--ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, one-hop reasoned, and relation-reversed data. To rigorously evaluate generalisation, we introduce UGBench, the first comprehensive benchmark specifically designed to assess the unlearning of in-scope implicit knowledge covering 13 state-of-the-art methods across three datasets. UGBench reveals that unlearned models can still recall paraphrased answers and retain target facts in intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/UGBench.
中文摘要:本文提出PerMU这一基于概率扰动的新型遗忘方法,通过模拟对抗性遗忘样本来消除显性目标数据及相关隐性知识,从而显著提升大语言模型的广义知识遗忘能力。
English Summary: This paper introduces PerMU, a novel probability perturbation-based unlearning method designed to enhance generalized knowledge forgetting in large language models by eliminating both explicit target data and related implicit knowledge through adversarial sample simulation.

Authors:Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao
Title: Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models
Abstract:
In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation, ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, relation-reversed, and one-hop reasoned data. We then conduct a rigorous evaluation of 15 state-of-the-art methods across three datasets, revealing that unlearned models still recall paraphrased answers and retain target facts in their intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/PERMU.
中文摘要:本文提出PerMU这一基于概率扰动的新型遗忘方法,通过模拟对抗性遗忘样本来消除显性目标数据及相关隐性知识,从而显著提升大语言模型的广义知识遗忘能力。
English Summary: This paper introduces PerMU, a novel probability perturbation-based unlearning method designed to enhance generalized knowledge forgetting in large language models by eliminating both explicit target data and related implicit knowledge through adversarial sample simulation.

Authors:Quanxing Zha, Xin Liu, Shu-Juan Peng, Yiu-ming Cheung, Xing Xu, Nannan Wang
Title: ReCon: Enhancing True Correspondence Discrimination through Relation Consistency for Robust Noisy Correspondence Learning
Abstract:
Can we accurately identify the true correspondences from multimodal datasets containing mismatched data pairs? Existing methods primarily emphasize the similarity matching between the representations of objects across modalities, potentially neglecting the crucial relation consistency within modalities that are particularly important for distinguishing the true and false correspondences. Such an omission often runs the risk of misidentifying negatives as positives, thus leading to unanticipated performance degradation. To address this problem, we propose a general Relation Consistency learning framework, namely ReCon, to accurately discriminate the true correspondences among the multimodal data and thus effectively mitigate the adverse impact caused by mismatches. Specifically, ReCon leverages a novel relation consistency learning to ensure the dual-alignment, respectively of, the cross-modal relation consistency between different modalities and the intra-modal relation consistency within modalities. Thanks to such dual constrains on relations, ReCon significantly enhances its effectiveness for true correspondence discrimination and therefore reliably filters out the mismatched pairs to mitigate the risks of wrong supervisions. Extensive experiments on three widely-used benchmark datasets, including Flickr30K, MS-COCO, and Conceptual Captions, are conducted to demonstrate the effectiveness and superiority of ReCon compared with other SOTAs. The code is available at: https://github.com/qxzha/ReCon.
中文摘要:现有方法因忽视模态内关系一致性而难以区分多模态数据中的真实对应关系,但提出的ReCon框架通过强制跨模态和模态内关系一致性的双重对齐,能准确识别真实匹配并过滤误配对。
English Summary: Existing multimodal matching methods often fail to distinguish true correspondences due to neglecting relation consistency within modalities, but the proposed ReCon framework addresses this by enforcing dual-alignment of cross-modal and intra-modal relation consistency to accurately identify true matches and filter mismatches.

Authors:Xinghao Wang, Feng Liu, Rui Su, Zhihui Wang, Lihua Fang, Lianqing Zhou, Lei Bai, Wanli Ouyang
Title: SeisMoLLM: Advancing Seismic Monitoring via Cross-modal Transfer with Pre-trained Large Language Model
Abstract:
Recent advances in deep learning have revolutionized seismic monitoring, yet developing a foundation model that performs well across multiple complex tasks remains challenging, particularly when dealing with degraded signals or data scarcity. This work presents SeisMoLLM, the first foundation model that utilizes cross-modal transfer for seismic monitoring, to unleash the power of large-scale pre-training from a large language model without requiring direct pre-training on seismic datasets. Through elaborate waveform tokenization and fine-tuning of pre-trained GPT-2 model, SeisMoLLM achieves state-of-the-art performance on the DiTing and STEAD datasets across five critical tasks: back-azimuth estimation, epicentral distance estimation, magnitude estimation, phase picking, and first-motion polarity classification. It attains 36 best results out of 43 task metrics and 12 top scores out of 16 few-shot generalization metrics, with many relative improvements ranging from 10% to 50%. In addition to its superior performance, SeisMoLLM maintains efficiency comparable to or even better than lightweight models in both training and inference. These findings establish SeisMoLLM as a promising foundation model for practical seismic monitoring and highlight cross-modal transfer as an exciting new direction for earthquake studies, showcasing the potential of advanced deep learning techniques to propel seismology research forward.
Chinese: SeisMoLLM作为首个利用预训练GPT-2进行跨模态迁移的地震监测基础模型,在五项关键任务中实现最优性能,无需直接地震数据预训练即展现出显著的效果提升和运行效率。
English: SeisMoLLM is a pioneering foundation model that leverages cross-modal transfer from a pre-trained GPT-2 to achieve state-of-the-art performance across five key seismic monitoring tasks, demonstrating significant improvements in accuracy and efficiency without requiring direct seismic data pre-training.

Authors:Yuan-Chih Yang, Hung-Hsuan Chen
Title: Dynamic DropConnect: Enhancing Neural Network Robustness through Adaptive Edge Dropping Strategies
Abstract:
Dropout and DropConnect are well-known techniques that apply a consistent drop rate to randomly deactivate neurons or edges in a neural network layer during training. This paper introduces a novel methodology that assigns dynamic drop rates to each edge within a layer, uniquely tailoring the dropping process without incorporating additional learning parameters. We perform experiments on synthetic and openly available datasets to validate the effectiveness of our approach. The results demonstrate that our method outperforms Dropout, DropConnect, and Standout, a classic mechanism known for its adaptive dropout capabilities. Furthermore, our approach improves the robustness and generalization of neural network training without increasing computational complexity. The complete implementation of our methodology is publicly accessible for research and replication purposes at https://github.com/ericabd888/Adjusting-the-drop-probability-in-DropConnect-based-on-the-magnitude-of-the-gradient/.
中文摘要:本文提出了一种新方法,可在神经网络层中为每条边动态调整丢弃率,相比Dropout和DropConnect等现有技术,该方法在不增加计算复杂度的前提下显著提升了模型的鲁棒性和泛化能力。
English Summary: This paper introduces a novel method that dynamically adjusts drop rates for each edge in neural network layers, outperforming existing techniques like Dropout and DropConnect in robustness and generalization without added computational cost.

Authors:Xiang Geng, Zhejian Lai, Jiajun Chen, Hao Yang, Shujian Huang
Title: Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation
Abstract:
Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task. Due to the data scarcity, synthetic data generation has emerged as a promising solution. However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences. To tackle this issue, we introduce DCSQE, a novel framework for alleviating distribution shift in synthetic QE data. To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models. DCSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes, enhancing the quality of token-level labels. DCSQE further identifies the shortest phrase covering consecutive error tokens, mimicking human annotation behavior, to assign the final phrase-level labels. Specially, we underscore that the translation model can not annotate translations of itself accurately. Extensive experiments demonstrate that DCSQE outperforms SOTA baselines like CometKiwi in both supervised and unsupervised settings. Further analysis offers insights into synthetic data generation that could benefit reward models for other tasks. The code is available at https://github.com/NJUNLP/njuqe.
中文: DCSQE是一种新颖框架,通过采用约束束搜索、增强翻译多样性以及利用参考译文指导生成和标注过程,有效缓解合成质量估计数据中的分布偏移问题,从而提升词级和短语级标签的质量。
English: DCSQE is a novel framework designed to mitigate distribution shift in synthetic Quality Estimation data by employing constrained beam search, enhancing translation diversity, and using references to guide generation and annotation, ultimately improving token- and phrase-level label quality.

Authors:Yung-Peng Hsu, Hung-Hsuan Chen
Title: Flexible Bivariate Beta Mixture Model: A Probabilistic Approach for Clustering Complex Data Structures
Abstract:
Clustering is essential in data analysis and machine learning, but traditional algorithms like $k$-means and Gaussian Mixture Models (GMM) often fail with nonconvex clusters. To address the challenge, we introduce the Flexible Bivariate Beta Mixture Model (FBBMM), which utilizes the flexibility of the bivariate beta distribution to handle diverse and irregular cluster shapes. Using the Expectation Maximization (EM) algorithm and Sequential Least Squares Programming (SLSQP) optimizer for parameter estimation, we validate FBBMM on synthetic and real-world datasets, demonstrating its superior performance in clustering complex data structures, offering a robust solution for big data analytics across various domains. We release the experimental code at https://github.com/yung-peng/MBMM-and-FBBMM.
中文摘要:灵活双变量贝塔混合模型(FBBMM)通过利用双变量贝塔分布的灵活性,有效处理非凸聚类问题,在合成和真实数据集上均展现出优于传统聚类算法的性能。
English Summary: The Flexible Bivariate Beta Mixture Model (FBBMM) overcomes limitations of traditional clustering methods by handling irregular cluster shapes through bivariate beta distributions and demonstrates superior performance on complex datasets.

Authors:Berken Utku Demirel, Christian Holz
Title: Shifting the Paradigm: A Diffeomorphism Between Time Series Data Manifolds for Achieving Shift-Invariancy in Deep Learning
Abstract:
Deep learning models lack shift invariance, making them sensitive to input shifts that cause changes in output. While recent techniques seek to address this for images, our findings show that these approaches fail to provide shift-invariance in time series, where the data generation mechanism is more challenging due to the interaction of low and high frequencies. Worse, they also decrease performance across several tasks. In this paper, we propose a novel differentiable bijective function that maps samples from their high-dimensional data manifold to another manifold of the same dimension, without any dimensional reduction. Our approach guarantees that samples -- when subjected to random shifts -- are mapped to a unique point in the manifold while preserving all task-relevant information without loss. We theoretically and empirically demonstrate that the proposed transformation guarantees shift-invariance in deep learning models without imposing any limits to the shift. Our experiments on six time series tasks with state-of-the-art methods show that our approach consistently improves the performance while enabling models to achieve complete shift-invariance without modifying or imposing restrictions on the model's topology. The source code is available on \href{https://github.com/eth-siplab/Shifting-the-Paradigm}{GitHub}.
Chinese Summary: 本文提出一种新颖的可微分双射函数,通过将样本映射到同维流形,使深度学习模型在不损失性能或改变架构的前提下,实现对时间序列数据的完全平移不变性。
English Summary: This paper introduces a novel differentiable bijective function that ensures deep learning models achieve complete shift-invariance for time series data without performance loss or architectural constraints.

Authors:Marco Pleines, Daniel Addis, David Rubinstein, Frank Zimmer, Mike Preuss, Peter Whidden
Title: Pokemon Red via Reinforcement Learning
Abstract:
Pokémon Red, a classic Game Boy JRPG, presents significant challenges as a testbed for agents, including multi-tasking, long horizons of tens of thousands of steps, hard exploration, and a vast array of potential policies. We introduce a simplistic environment and a Deep Reinforcement Learning (DRL) training methodology, demonstrating a baseline agent that completes an initial segment of the game up to completing Cerulean City. Our experiments include various ablations that reveal vulnerabilities in reward shaping, where agents exploit specific reward signals. We also discuss limitations and argue that games like Pokémon hold strong potential for future research on Large Language Model agents, hierarchical training algorithms, and advanced exploration methods. Source Code: https://github.com/MarcoMeter/neroRL/tree/poke_red
中文: 该研究为《宝可梦红》设计了一个简化环境与深度强化学习方法,展示了智能体在游戏初期的进展,揭示了奖励塑造的脆弱性,并强调了此类游戏在未来人工智能研究中的潜力。
English: The study presents a simplistic environment and Deep Reinforcement Learning method for Pokémon Red, demonstrating a baseline agent's initial game progress while revealing reward shaping vulnerabilities and highlighting the game's potential for future AI research.

Authors:Zhenyu Liu, Yunxin Li, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang
Title: Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents
Abstract:
To improve Multimodal Large Language Models' (MLLMs) ability to process images and complex instructions, researchers predominantly curate large-scale visual instruction tuning datasets, which are either sourced from existing vision tasks or synthetically generated using LLMs and image descriptions. However, they often suffer from critical flaws, including misaligned instruction-image pairs and low-quality images. Such issues hinder training efficiency and limit performance improvements, as models waste resources on noisy or irrelevant data with minimal benefit to overall capability. To address this issue, we propose a \textbf{Vi}sual-Centric \textbf{S}election approach via \textbf{A}gents Collaboration (ViSA), which centers on image quality assessment and image-instruction relevance evaluation. Specifically, our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images. Finally, we reorganize 80K instruction data from large open-source datasets. Extensive experiments demonstrate that ViSA outperforms or is comparable to current state-of-the-art models on seven benchmarks, using only 2.5\% of the original data, highlighting the efficiency of our data selection approach. Moreover, we conduct ablation studies to validate the effectiveness of each component of our method. The code is available at https://github.com/HITsz-TMG/ViSA.
中文摘要:研究者提出ViSA方法,通过视觉代理协作筛选高质量图像和相关指令来优化多模态大语言模型,仅用2.5%的数据即在七个基准测试中达到最优性能。
English Summary: Researchers propose ViSA, a visual-centric data selection method using agent collaboration to enhance MLLMs by filtering high-quality images and relevant instructions, achieving state-of-the-art performance with only 2.5% of data across seven benchmarks.

Authors:Nikolay Blagoev, Lydia Yiyu Chen, Oğuzhan Ersoy
Title: SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks
Abstract:
Data and pipeline parallelism are ubiquitous for training of Large Language Models (LLM) on distributed nodes. Driven by the need for cost-effective training, recent work explores efficient communication arrangement for end to end training. Motivated by LLM's resistance to layer skipping and layer reordering, in this paper, we explore stage (several consecutive layers) skipping in pipeline training, and challenge the conventional practice of sequential pipeline execution. We derive convergence and throughput constraints (guidelines) for pipelining with skipping and swapping pipeline stages. Based on these constraints, we propose SkipPipe, the first partial pipeline framework to reduce the end-to-end training time for LLMs while preserving the convergence. The core of SkipPipe is a path scheduling algorithm that optimizes the paths for individual microbatches and reduces idle time (due to microbatch collisions) on the distributed nodes, complying with the given stage skipping ratio. We extensively evaluate SkipPipe on LLaMa models from 500M to 8B parameters on up to 20 nodes. Our results show that SkipPipe reduces training iteration time by up to $55\%$ compared to full pipeline. Our partial pipeline training also improves resistance to layer omission during inference, experiencing a drop in perplexity of only $7\%$ when running only half the model. Our code is available at https://github.com/gensyn-ai/skippipe.
Chinese: SkipPipe提出了一种部分流水线框架,通过启用阶段跳过和路径调度来优化大型语言模型的训练效率,在保持收敛性的同时将迭代时间减少高达55%。
English: SkipPipe introduces a partial pipeline framework that optimizes training efficiency for Large Language Models by enabling stage skipping and path scheduling, reducing iteration time by up to 55% while maintaining convergence.

Authors:Yuhao Li, Mirana Claire Angel, Salman Khan, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan
Title: C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation
Abstract:
Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. However, the existing trajectory-based approaches are usually limited to only generating the motion trajectory of the controlled object and ignoring the dynamic interactions between the controlled object and its surroundings. To address this limitation, we propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Instead of directly generating the motion of some objects, our C-Drag first performs object perception and then reasons the dynamic interactions between different objects according to the given motion control of the objects. Specifically, our method includes an object perception module and a Chain-of-Thought-based motion reasoning module. The object perception module employs visual language models to capture the position and category information of various objects within the image. The Chain-of-Thought-based motion reasoning module takes this information as input and conducts a stage-wise reasoning process to generate motion trajectories for each of the affected objects, which are subsequently fed to the diffusion model for video synthesis. Furthermore, we introduce a new video object interaction (VOI) dataset to evaluate the generation quality of motion controlled video generation methods. Our VOI dataset contains three typical types of interactions and provides the motion trajectories of objects that can be used for accurate performance evaluation. Experimental results show that C-Drag achieves promising performance across multiple metrics, excelling in object motion control. Our benchmark, codes, and models will be available at https://github.com/WesLee88524/C-Drag-Official-Repo.
中文: 提出的C-Drag方法通过结合物体感知和思维链推理来建模物体间的动态交互,改进了可控视频生成技术,其性能优于仅关注单个物体运动轨迹的现有方法。
English: The proposed C-Drag method enhances controllable video generation by incorporating object perception and Chain-of-Thought reasoning to model dynamic interactions between objects, outperforming existing trajectory-based approaches that focus solely on individual object motion.

Authors:Nan An, Long Ma, Guangchao Han, Xin Fan, RIsheng Liu
Title: Striving for Faster and Better: A One-Layer Architecture with Auto Re-parameterization for Low-Light Image Enhancement
Abstract:
Deep learning-based low-light image enhancers have made significant progress in recent years, with a trend towards achieving satisfactory visual quality while gradually reducing the number of parameters and improving computational efficiency. In this work, we aim to delving into the limits of image enhancers both from visual quality and computational efficiency, while striving for both better performance and faster processing. To be concrete, by rethinking the task demands, we build an explicit connection, i.e., visual quality and computational efficiency are corresponding to model learning and structure design, respectively. Around this connection, we enlarge parameter space by introducing the re-parameterization for ample model learning of a pre-defined minimalist network (e.g., just one layer), to avoid falling into a local solution. To strengthen the structural representation, we define a hierarchical search scheme for discovering a task-oriented re-parameterized structure, which also provides powerful support for efficiency. Ultimately, this achieves efficient low-light image enhancement using only a single convolutional layer, while maintaining excellent visual quality. Experimental results show our sensible superiority both in quality and efficiency against recently-proposed methods. Especially, our running time on various platforms (e.g., CPU, GPU, NPU, DSP) consistently moves beyond the existing fastest scheme. The source code will be released at https://github.com/vis-opt-group/AR-LLIE.
中文: 本研究通过结合重参数化和分层搜索方案,提出了一种高效的低光图像增强方法,在多个平台上均实现了优于现有方法的视觉质量和计算速度。
English: This study introduces an efficient low-light image enhancement method that achieves superior visual quality and computational speed by combining re-parameterization with a hierarchical search scheme, outperforming existing approaches across multiple platforms.

Authors:Chunyang Cheng, Tianyang Xu, Zhenhua Feng, Xiaojun Wu, ZhangyongTang, Hui Li, Zeyang Zhang, Sara Atito, Muhammad Awais, Josef Kittler
Title: One Model for ALL: Low-Level Task Interaction Is a Key to Task-Agnostic Image Fusion
Abstract:
Advanced image fusion methods mostly prioritise high-level missions, where task interaction struggles with semantic gaps, requiring complex bridging mechanisms. In contrast, we propose to leverage low-level vision tasks from digital photography fusion, allowing for effective feature interaction through pixel-level supervision. This new paradigm provides strong guidance for unsupervised multimodal fusion without relying on abstract semantics, enhancing task-shared feature learning for broader applicability. Owning to the hybrid image features and enhanced universal representations, the proposed GIFNet supports diverse fusion tasks, achieving high performance across both seen and unseen scenarios with a single model. Uniquely, experimental results reveal that our framework also supports single-modality enhancement, offering superior flexibility for practical applications. Our code will be available at https://github.com/AWCXV/GIFNet.
中文: 提出的GIFNet利用低层视觉任务和像素级监督,实现了无需抽象语义的无监督多模态融合,在多种任务中表现优异,并能支持单模态增强,具有广泛适用性。
English: The proposed GIFNet leverages low-level vision tasks with pixel-level supervision to enable effective unsupervised multimodal fusion, achieving high performance across diverse tasks and supporting single-modality enhancement without relying on abstract semantics.

Authors:Xiaofan Li, Xin Tan, Zhuo Chen, Zhizhong Zhang, Ruixin Zhang, Rizen Guo, Guannan Jiang, Yulong Chen, Yanyun Qu, Lizhuang Ma, Yuan Xie
Title: One-for-More: Continual Diffusion Model for Anomaly Detection
Abstract:
With the rise of generative models, there is a growing interest in unifying all tasks within a generative framework. Anomaly detection methods also fall into this scope and utilize diffusion models to generate or reconstruct normal samples when given arbitrary anomaly images. However, our study found that the diffusion model suffers from severe ``faithfulness hallucination'' and ``catastrophic forgetting'', which can't meet the unpredictable pattern increments. To mitigate the above problems, we propose a continual diffusion model that uses gradient projection to achieve stable continual learning. Gradient projection deploys a regularization on the model updating by modifying the gradient towards the direction protecting the learned knowledge. But as a double-edged sword, it also requires huge memory costs brought by the Markov process. Hence, we propose an iterative singular value decomposition method based on the transitive property of linear representation, which consumes tiny memory and incurs almost no performance loss. Finally, considering the risk of ``over-fitting'' to normal images of the diffusion model, we propose an anomaly-masked network to enhance the condition mechanism of the diffusion model. For continual anomaly detection, ours achieves first place in 17/18 settings on MVTec and VisA. Code is available at https://github.com/FuNz-0/One-for-More
中文: 本研究针对扩散模型在异常检测中的问题,提出了采用梯度投影的持续扩散模型和异常掩码网络,在持续异常检测基准中取得了领先性能。
English: This study addresses issues in diffusion-based anomaly detection by proposing a continual diffusion model with gradient projection and an anomaly-masked network, achieving top performance in most continual anomaly detection benchmarks.

Authors:Long Xu, Kaixin Chai, Boyuan An, Jiaxiang Gan, Shuhang Ji, Zhenyu Hou, Qianhao Wang, Yuan Zhou, Xiaoying Li, Junxiao Lin, Zhichao Han, Chao Xu, Yanjun Cao, Fei Gao
Title: Tracailer: An Efficient Trajectory Planner for Tractor-Trailer Robots in Unstructured Environments
Abstract:
The tractor-trailer robot consists of a drivable tractor and one or more non-drivable trailers connected via hitches. Compared to typical car-like robots, the addition of trailers provides greater transportation capability. However, this also complicates motion planning due to the robot's complex kinematics, high-dimensional state space, and deformable structure. To efficiently plan safe, time-optimal trajectories that adhere to the kinematic constraints of the robot and address the challenges posed by its unique features, this paper introduces a lightweight, compact, and high-order smooth trajectory representation for tractor-trailer robots. Based on it, we design an efficiently solvable spatial-temporal trajectory optimization problem. To deal with deformable structures, which leads to difficulties in collision avoidance, we fully leverage the collisionfree regions of the environment, directly applying deformations to trajectories in continuous space. This approach not requires constructing safe regions from the environment using convex approximations through collision-free seed points before each optimization, avoiding the loss of the solution space, thus reducing the dependency of the optimization on initial values. Moreover, a multi-terminal fast path search algorithm is proposed to generate the initial values for optimization. Extensive simulation experiments demonstrate that our approach achieves severalfold improvements in efficiency compared to existing algorithms, while also ensuring lower curvature and trajectory duration. Real-world experiments involving the transportation, loading and unloading of goods in both indoor and outdoor scenarios further validate the effectiveness of our method. The source code is accessible at https://github.com/Tracailer/Tracailer.
中文: 本文针对拖拉机-拖车机器人提出了一种轻量级高阶平滑轨迹优化方法,通过利用环境中的无碰撞区域和连续空间变形技术,高效处理了复杂运动学和可变形结构带来的挑战,在仿真和实际实验中实现了效率的显著提升和更平滑的轨迹规划。
English: This paper introduces a lightweight, high-order smooth trajectory optimization method for tractor-trailer robots that efficiently handles complex kinematics and deformable structures by leveraging collision-free regions and continuous space deformation, achieving significant efficiency gains and smoother trajectories in simulations and real-world experiments.

Authors:Vidhi Lalchand, Anna-Christina Eilers
Title: Shared Stochastic Gaussian Process Latent Variable Models: A Multi-modal Generative Model for Quasar Spectra
Abstract:
This work proposes a scalable probabilistic latent variable model based on Gaussian processes (Lawrence, 2004) in the context of multiple observation spaces. We focus on an application in astrophysics where data sets typically contain both observed spectral features and scientific properties of astrophysical objects such as galaxies or exoplanets. In our application, we study the spectra of very luminous galaxies known as quasars, along with their properties, such as the mass of their central supermassive black hole, accretion rate, and luminosity-resulting in multiple observation spaces. A single data point is then characterized by different classes of observations, each with different likelihoods. Our proposed model extends the baseline stochastic variational Gaussian process latent variable model (GPLVM) introduced by Lalchand et al. (2022) to this setting, proposing a seamless generative model where the quasar spectra and scientific labels can be generated simultaneously using a shared latent space as input to different sets of Gaussian process decoders, one for each observation space. Additionally, this framework enables training in a missing data setting where a large number of dimensions per data point may be unknown or unobserved. We demonstrate high-fidelity reconstructions of the spectra and scientific labels during test-time inference and briefly discuss the scientific interpretations of the results, along with the significance of such a generative model.
本研究提出了一种基于高斯过程的可扩展概率潜变量模型,通过共享潜空间处理多个观测空间,能够同时生成类星体光谱及其科学属性,并在训练中有效应对数据缺失问题。
This study introduces a scalable probabilistic latent variable model using Gaussian processes to handle multiple observation spaces, enabling simultaneous generation of quasar spectra and scientific properties through shared latent representations and accommodating missing data during training.

Authors:Zixuan Weng, Xiaolong Jin, Jinyuan Jia, Xiangyu Zhang
Title: Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Abstract:
Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs. Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical transgressions. Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses. Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn interactions. The code is available at https://github.com/Jinxiaolong1129/Foot-in-the-door-Jailbreak.
中文摘要:FITD是一种受心理学原理启发的新型多轮越狱方法,通过逐步升级恶意查询来绕过AI安全防护,实现了94%的攻击成功率,揭示了当前对齐策略中的漏洞。
English Summary: FITD is a novel multi-turn jailbreak method inspired by psychological principles that progressively escalates malicious queries to bypass AI safeguards, achieving a 94% attack success rate and exposing vulnerabilities in current alignment strategies.

Authors:Jiacheng Ye, Zhenyu Wu, Jiahui Gao, Zhiyong Wu, Xin Jiang, Zhenguo Li, Lingpeng Kong
Title: Implicit Search via Discrete Diffusion: A Study on Chess
Abstract:
In the post-AlphaGo era, there has been a renewed interest in search techniques such as Monte Carlo Tree Search (MCTS), particularly in their application to Large Language Models (LLMs). This renewed attention is driven by the recognition that current next-token prediction models often lack the ability for long-term planning. Is it possible to instill search-like abilities within the models to enhance their planning abilities without relying on explicit search? We propose DiffuSearch , a model that does \textit{implicit search} by looking into the future world via discrete diffusion modeling. We instantiate DiffuSearch on a classical board game, Chess, where explicit search is known to be essential. Through extensive controlled experiments, we show DiffuSearch outperforms both the searchless and explicit search-enhanced policies. Specifically, DiffuSearch outperforms the one-step policy by 19.2% and the MCTS-enhanced policy by 14% on action accuracy. Furthermore, DiffuSearch demonstrates a notable 30% enhancement in puzzle-solving abilities compared to explicit search-based policies, along with a significant 540 Elo increase in game-playing strength assessment. These results indicate that implicit search via discrete diffusion is a viable alternative to explicit search over a one-step policy. All codes are publicly available at \href{https://github.com/HKUNLP/DiffuSearch}{https://github.com/HKUNLP/DiffuSearch}.
中文: 提出的DiffuSearch模型通过离散扩散实现隐式搜索,在象棋实验中显著提升了大型语言模型的规划能力,其表现优于无搜索和显式搜索方法。
English: The proposed DiffuSearch model employs discrete diffusion to perform implicit search, significantly enhancing planning capabilities in LLMs and outperforming both searchless and explicit search methods in chess experiments.

Authors:Aayush Dhakal, Srikumar Sastry, Subash Khanal, Adeel Ahmad, Eric Xing, Nathan Jacobs
Title: RANGE: Retrieval Augmented Neural Fields for Multi-Resolution Geo-Embeddings
Abstract:
The choice of representation for geographic location significantly impacts the accuracy of models for a broad range of geospatial tasks, including fine-grained species classification, population density estimation, and biome classification. Recent works like SatCLIP and GeoCLIP learn such representations by contrastively aligning geolocation with co-located images. While these methods work exceptionally well, in this paper, we posit that the current training strategies fail to fully capture the important visual features. We provide an information-theoretic perspective on why the resulting embeddings from these methods discard crucial visual information that is important for many downstream tasks. To solve this problem, we propose a novel retrieval-augmented strategy called RANGE. We build our method on the intuition that the visual features of a location can be estimated by combining the visual features from multiple similar-looking locations. We evaluate our method across a wide variety of tasks. Our results show that RANGE outperforms the existing state-of-the-art models with significant margins in most tasks. We show gains of up to 13.1% on classification tasks and 0.145 $R^2$ on regression tasks. All our code and models will be made available at: https://github.com/mvrl/RANGE.
中文: 本文提出RANGE方法,通过整合相似地理位置的多重视觉特征来改进地理空间表示学习,在多项任务中显著超越现有最优模型。
English: The paper introduces RANGE, a retrieval-augmented strategy that enhances geospatial representation learning by combining visual features from similar locations, achieving significant performance gains over existing methods across various tasks.

Authors:Hugo Lyons Keenan, Sarah Erfani, Christopher Leckie
Title: HALO: Robust Out-of-Distribution Detection via Joint Optimisation
Abstract:
Effective out-of-distribution (OOD) detection is crucial for the safe deployment of machine learning models in real-world scenarios. However, recent work has shown that OOD detection methods are vulnerable to adversarial attacks, potentially leading to critical failures in high-stakes applications. This discovery has motivated work on robust OOD detection methods that are capable of maintaining performance under various attack settings. Prior approaches have made progress on this problem but face a number of limitations: often only exhibiting robustness to attacks on OOD data or failing to maintain strong clean performance. In this work, we adapt an existing robust classification framework, TRADES, extending it to the problem of robust OOD detection and discovering a novel objective function. Recognising the critical importance of a strong clean/robust trade-off for OOD detection, we introduce an additional loss term which boosts classification and detection performance. Our approach, called HALO (Helper-based AdversariaL OOD detection), surpasses existing methods and achieves state-of-the-art performance across a number of datasets and attack settings. Extensive experiments demonstrate an average AUROC improvement of 3.15 in clean settings and 7.07 under adversarial attacks when compared to the next best method. Furthermore, HALO exhibits resistance to transferred attacks, offers tuneable performance through hyperparameter selection, and is compatible with existing OOD detection frameworks out-of-the-box, leaving open the possibility of future performance gains. Code is available at: https://github.com/hugo0076/HALO
中文: HALO方法通过引入新型损失函数扩展了TRADES框架,在多个数据集上实现了最先进的鲁棒分布外检测性能,在正常和对抗攻击场景下均展现出显著提升。
English: The HALO method extends the TRADES framework to robust out-of-distribution detection by introducing a novel loss term, achieving state-of-the-art performance with significant improvements in both clean and adversarial settings across multiple datasets.

Authors:Xingyu Qiu, Mengying Yang, Xinghua Ma, Fanding Li, Dong Liang, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li
Title: Finding Local Diffusion Schrödinger Bridge using Kolmogorov-Arnold Network
Abstract:
In image generation, Schrödinger Bridge (SB)-based methods theoretically enhance the efficiency and quality compared to the diffusion models by finding the least costly path between two distributions. However, they are computationally expensive and time-consuming when applied to complex image data. The reason is that they focus on fitting globally optimal paths in high-dimensional spaces, directly generating images as next step on the path using complex networks through self-supervised training, which typically results in a gap with the global optimum. Meanwhile, most diffusion models are in the same path subspace generated by weights $f_A(t)$ and $f_B(t)$, as they follow the paradigm ($x_t = f_A(t)x_{Img} + f_B(t)ε$). To address the limitations of SB-based methods, this paper proposes for the first time to find local Diffusion Schrödinger Bridges (LDSB) in the diffusion path subspace, which strengthens the connection between the SB problem and diffusion models. Specifically, our method optimizes the diffusion paths using Kolmogorov-Arnold Network (KAN), which has the advantage of resistance to forgetting and continuous output. The experiment shows that our LDSB significantly improves the quality and efficiency of image generation using the same pre-trained denoising network and the KAN for optimising is only less than 0.1MB. The FID metric is reduced by more than 15\%, especially with a reduction of 48.50\% when NFE of DDIM is $5$ for the CelebA dataset. Code is available at https://github.com/PerceptionComputingLab/LDSB.
中文: 本文提出局部扩散薛定谔桥方法,在扩散路径子空间内利用科尔莫戈罗夫-阿诺德网络优化路径,以极小的计算开销显著提升图像生成质量与效率。
English: This paper introduces Local Diffusion Schrödinger Bridges (LDSB), a method that optimizes diffusion paths within a constrained subspace using Kolmogorov-Arnold Networks to significantly enhance image generation quality and efficiency while minimizing computational costs.

Authors:Jinhao Pan, Chahat Raj, Ziyu Yao, Ziwei Zhu
Title: What's Not Said Still Hurts: A Description-Based Evaluation Framework for Measuring Social Bias in LLMs
Abstract:
Large Language Models (LLMs) often exhibit social biases inherited from their training data. While existing benchmarks evaluate bias by term-based mode through direct term associations between demographic terms and bias terms, LLMs have become increasingly adept at avoiding biased responses, leading to seemingly low levels of bias. However, biases persist in subtler, contextually hidden forms that traditional benchmarks fail to capture. We introduce the Description-based Bias Benchmark (DBB), a novel dataset designed to assess bias at the semantic level that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios rather than superficial terms. We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response at the term level, they continue to reinforce biases in nuanced settings. Data, code, and results are available at https://github.com/JP-25/Description-based-Bias-Benchmark.
Chinese: 描述性偏见基准(DBB)是一种新颖的数据集,旨在通过自然语境中隐藏的语义层面评估大型语言模型的偏见,发现尽管模型在表层术语上减少了偏见,但在微妙情境中仍持续强化偏见。
English: The Description-based Bias Benchmark (DBB) is a new dataset that uncovers subtle, contextually embedded biases in Large Language Models, which traditional term-based evaluations miss, revealing that models still reinforce biases in nuanced scenarios despite appearing less biased superficially.

Authors:Hannah Cyberey, Yangfeng Ji, David Evans
Title: Unsupervised Concept Vector Extraction for Bias Control in LLMs
Abstract:
Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate these biases, but most work studies biases as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of "gender" is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model's representation. We develop a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs and show that it also generalizes to racial bias. Our code is available at: https://github.com/hannahxchen/gender-bias-steering
中文: 研究人员基于表征工程开发了一种投影方法,无需标注数据即可精确测量和操控大语言模型中的性别与种族偏见,有效缓解了刻板印象问题。
English: Researchers developed a projection-based method using representation engineering to precisely measure and manipulate gender and racial bias in large language models, effectively mitigating stereotypes without requiring labeled data.

Authors:Xinran Li, Xiaolu Wang, Chenjia Bai, Jun Zhang
Title: Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning
Abstract:
In cooperative multi-agent reinforcement learning (MARL), well-designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large-scale multi-agent systems commonly found in real-world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller-scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links-a task that becomes increasingly complex as the number of agents grows-we adopt a global perspective on communication topology design. Specifically, we propose utilizing the exponential topology to enable rapid information dissemination among agents by leveraging its small-diameter and small-size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory-based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision-making. Extensive experiments on large-scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero-shot transferability of ExpoComm compared to existing communication strategies. The code is publicly available at https://github.com/LXXXXR/ExpoComm.
中文摘要:本研究提出ExpoComm协议,通过采用指数图拓扑结构实现多智能体间高效信息传播,结合记忆消息处理器提升决策能力,在大规模协作任务中展现出优越性能和零样本迁移能力。
English Summary: The study introduces ExpoComm, a scalable communication protocol for multi-agent reinforcement learning that uses exponential graph topology to enhance information sharing and decision-making among agents, demonstrating superior performance in large-scale cooperative tasks.

Authors:Qijie Xu, Defang Chen, Jiawei Chen, Siwei Lyu, Can Wang
Title: Recent Advances on Generalizable Diffusion-generated Image Detection
Abstract:
The rise of diffusion models has significantly improved the fidelity and diversity of generated images. With numerous benefits, these advancements also introduce new risks. Diffusion models can be exploited to create high-quality Deepfake images, which poses challenges for image authenticity verification. In recent years, research on generalizable diffusion-generated image detection has grown rapidly. However, a comprehensive review of this topic is still lacking. To bridge this gap, we present a systematic survey of recent advances and classify them into two main categories: (1) data-driven detection and (2) feature-driven detection. Existing detection methods are further classified into six fine-grained categories based on their underlying principles. Finally, we identify several open challenges and envision some future directions, with the hope of inspiring more research work on this important topic. Reviewed works in this survey can be found at https://github.com/zju-pi/Awesome-Diffusion-generated-Image-Detection.
中文摘要:本综述系统梳理了扩散生成图像检测的最新进展,将其分为数据驱动和特征驱动两大类方法,旨在应对深度伪造对图像真实性带来的挑战,并指出了该领域未来研究的关键方向。
English Summary: This survey systematically reviews and classifies recent advances in detecting diffusion-generated deepfake images into data-driven and feature-driven methods, addressing the growing risks they pose to image authenticity while highlighting open challenges and future research directions.

Authors:Jianning Chi, Zelan Li, Geng Lin, MingYang Sun, Xiaosheng Yu
Title: Weakly Supervised Segmentation Framework for Thyroid Nodule Based on High-confidence Labels and High-rationality Losses
Abstract:
Weakly supervised segmentation methods can delineate thyroid nodules in ultrasound images efficiently using training data with coarse labels, but suffer from: 1) low-confidence pseudo-labels that follow topological priors, introducing significant label noise, and 2) low-rationality loss functions that rigidly compare segmentation with labels, ignoring discriminative information for nodules with diverse and complex shapes. To solve these issues, we clarify the objective and references for weakly supervised ultrasound image segmentation, presenting a framework with high-confidence pseudo-labels to represent topological and anatomical information and high-rationality losses to capture multi-level discriminative features. Specifically, we fuse geometric transformations of four-point annotations and MedSAM model results prompted by specific annotations to generate high-confidence box, foreground, and background labels. Our high-rationality learning strategy includes: 1) Alignment loss measuring spatial consistency between segmentation and box label, and topological continuity within the foreground label, guiding the network to perceive nodule location; 2) Contrastive loss pulling features from labeled foreground regions while pushing features from labeled foreground and background regions, guiding the network to learn nodule and background feature distribution; 3) Prototype correlation loss measuring consistency between correlation maps derived by comparing features with foreground and background prototypes, refining uncertain regions to accurate nodule edges. Experimental results show that our method achieves state-of-the-art performance on the TN3K and DDTI datasets. The code is available at https://github.com/bluehenglee/MLI-MSC.
中文: 该框架通过融合四点标注的几何变换与MedSAM模型结果生成高置信度伪标签,并采用空间对齐、对比学习和原型相关等多层次损失策略,有效解决了弱监督超声图像分割中的标注噪声和形状复杂性问题,在TN3K和DDTI数据集上实现了最优性能。
English: The proposed framework addresses limitations in weakly supervised thyroid nodule segmentation by generating high-confidence pseudo-labels from fused geometric transformations and MedSAM outputs, while employing a multi-loss strategy to enhance spatial, topological, and feature discrimination for state-of-the-art results on benchmark datasets.

Authors:Chen-Chen Zong, Sheng-Jun Huang
Title: Rethinking Epistemic and Aleatoric Uncertainty for Active Open-Set Annotation: An Energy-Based Approach
Abstract:
Active learning (AL), which iteratively queries the most informative examples from a large pool of unlabeled candidates for model training, faces significant challenges in the presence of open-set classes. Existing methods either prioritize query examples likely to belong to known classes, indicating low epistemic uncertainty (EU), or focus on querying those with highly uncertain predictions, reflecting high aleatoric uncertainty (AU). However, they both yield suboptimal performance, as low EU corresponds to limited useful information, and closed-set AU metrics for unknown class examples are less meaningful. In this paper, we propose an Energy-based Active Open-set Annotation (EAOA) framework, which effectively integrates EU and AU to achieve superior performance. EAOA features a $(C+1)$-class detector and a target classifier, incorporating an energy-based EU measure and a margin-based energy loss designed for the detector, alongside an energy-based AU measure for the target classifier. Another crucial component is the target-driven adaptive sampling strategy. It first forms a smaller candidate set with low EU scores to ensure closed-set properties, making AU metrics meaningful. Subsequently, examples with high AU scores are queried to form the final query set, with the candidate set size adjusted adaptively. Extensive experiments show that EAOA achieves state-of-the-art performance while maintaining high query precision and low training overhead. The code is available at https://github.com/chenchenzong/EAOA.
主动学习在开放集类别中面临挑战,而提出的EAOA框架通过自适应采样策略有效整合认知和偶然不确定性,实现了最优性能。
Active learning struggles with open-set classes, but the proposed EAOA framework effectively combines epistemic and aleatoric uncertainty with adaptive sampling to achieve state-of-the-art performance.

Authors:Xiongfei Su, Tianyi Zhu, Lina Liu, Zheng Chen, Yulun Zhang, Siyuan Li, Juntian Ye, Feihu Xu, Xin Yuan
Title: Dual-branch Graph Feature Learning for NLOS Imaging
Abstract:
The domain of non-line-of-sight (NLOS) imaging is advancing rapidly, offering the capability to reveal occluded scenes that are not directly visible. However, contemporary NLOS systems face several significant challenges: (1) The computational and storage requirements are profound due to the inherent three-dimensional grid data structure, which restricts practical application. (2) The simultaneous reconstruction of albedo and depth information requires a delicate balance using hyperparameters in the loss function, rendering the concurrent reconstruction of texture and depth information difficult. This paper introduces the innovative methodology, \xnet, which integrates an albedo-focused reconstruction branch dedicated to albedo information recovery and a depth-focused reconstruction branch that extracts geometrical structure, to overcome these obstacles. The dual-branch framework segregates content delivery to the respective reconstructions, thereby enhancing the quality of the retrieved data. To our knowledge, we are the first to employ the GNN as a fundamental component to transform dense NLOS grid data into sparse structural features for efficient reconstruction. Comprehensive experiments demonstrate that our method attains the highest level of performance among existing methods across synthetic and real data. https://github.com/Nicholassu/DG-NLOS.
中文: 创新的\xnet方法通过采用双分支框架分别重建反照率和深度信息,利用图神经网络技术高效处理数据,克服了非视距成像中的关键挑战,实现了卓越性能。
English: The innovative \xnet method overcomes key challenges in non-line-of-sight imaging by employing a dual-branch framework that separately reconstructs albedo and depth information, achieving superior performance through efficient data processing with GNN technology.

Authors:Kanglei Zhou, Zikai Hao, Liyuan Wang, Xiaohui Liang
Title: Adaptive Score Alignment Learning for Continual Perceptual Quality Assessment of 360-Degree Videos in Virtual Reality
Abstract:
Virtual Reality Video Quality Assessment (VR-VQA) aims to evaluate the perceptual quality of 360-degree videos, which is crucial for ensuring a distortion-free user experience. Traditional VR-VQA methods trained on static datasets with limited distortion diversity struggle to balance correlation and precision. This becomes particularly critical when generalizing to diverse VR content and continually adapting to dynamic and evolving video distribution variations. To address these challenges, we propose a novel approach for assessing the perceptual quality of VR videos, Adaptive Score Alignment Learning (ASAL). ASAL integrates correlation loss with error loss to enhance alignment with human subjective ratings and precision in predicting perceptual quality. In particular, ASAL can naturally adapt to continually changing distributions through a feature space smoothing process that enhances generalization to unseen content. To further improve continual adaptation to dynamic VR environments, we extend ASAL with adaptive memory replay as a novel Continul Learning (CL) framework. Unlike traditional CL models, ASAL utilizes key frame extraction and feature adaptation to address the unique challenges of non-stationary variations with both the computation and storage restrictions of VR devices. We establish a comprehensive benchmark for VR-VQA and its CL counterpart, introducing new data splits and evaluation metrics. Our experiments demonstrate that ASAL outperforms recent strong baseline models, achieving overall correlation gains of up to 4.78\% in the static joint training setting and 12.19\% in the dynamic CL setting on various datasets. This validates the effectiveness of ASAL in addressing the inherent challenges of VR-VQA.Our code is available at https://github.com/ZhouKanglei/ASAL_CVQA.
中文: 提出的自适应分数对齐学习(ASAL)方法通过整合相关性与误差损失,有效提升VR视频质量评估与人类感知的一致性及对动态内容的适应性,在静态与持续学习场景中均实现了显著性能提升。
English: The proposed Adaptive Score Alignment Learning (ASAL) method enhances VR video quality assessment by integrating correlation and error losses to better align with human perception and adapt to dynamic content variations, achieving significant performance gains in both static and continual learning settings.

Authors:Hoonhee Cho, Jae-young Kang, Youngho Kim, Kuk-Jin Yoon
Title: Ev-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras
Abstract:
Detecting 3D objects in point clouds plays a crucial role in autonomous driving systems. Recently, advanced multi-modal methods incorporating camera information have achieved notable performance. For a safe and effective autonomous driving system, algorithms that excel not only in accuracy but also in speed and low latency are essential. However, existing algorithms fail to meet these requirements due to the latency and bandwidth limitations of fixed frame rate sensors, e.g., LiDAR and camera. To address this limitation, we introduce asynchronous event cameras into 3D object detection for the first time. We leverage their high temporal resolution and low bandwidth to enable high-speed 3D object detection. Our method enables detection even during inter-frame intervals when synchronized data is unavailable, by retrieving previous 3D information through the event camera. Furthermore, we introduce the first event-based 3D object detection dataset, DSEC-3DOD, which includes ground-truth 3D bounding boxes at 100 FPS, establishing the first benchmark for event-based 3D detectors. The code and dataset are available at https://github.com/mickeykang16/Ev3DOD.
中文: 本研究首次将异步事件相机引入自动驾驶的3D物体检测,利用其高时间分辨率实现帧间高速检测,并发布DSEC-3DOD数据集构建了首个事件驱动的3D检测基准。
English: This study introduces the first asynchronous event camera-based 3D object detection method for autonomous driving, enabling high-speed detection during inter-frame intervals and releasing the DSEC-3DOD dataset to establish an event-based 3D detection benchmark.

Authors:Hugues Turbé, Mina Bjelogrlic, Gianmarco Mengaldo, Christian Lovis
Title: Tell me why: Visual foundation models as self-explainable classifiers
Abstract:
Visual foundation models (VFMs) have become increasingly popular due to their state-of-the-art performance. However, interpretability remains crucial for critical applications. In this sense, self-explainable models (SEM) aim to provide interpretable classifiers that decompose predictions into a weighted sum of interpretable concepts. Despite their promise, recent studies have shown that these explanations often lack faithfulness. In this work, we combine VFMs with a novel prototypical architecture and specialized training objectives. By training only a lightweight head (approximately 1M parameters) on top of frozen VFMs, our approach (ProtoFM) offers an efficient and interpretable solution. Evaluations demonstrate that our approach achieves competitive classification performance while outperforming existing models across a range of interpretability metrics derived from the literature. Code is available at https://github.com/hturbe/proto-fm.
Chinese Summary: 本文提出的ProtoFM方法将视觉基础模型与原型架构及专门训练目标相结合,仅需在冻结模型上训练轻量级头部即可实现高效可解释的分类,在保持竞争力的分类性能同时显著超越了现有模型的可解释性指标。
English Summary: This paper introduces ProtoFM, an efficient and interpretable method that combines visual foundation models with a prototypical architecture and specialized training objectives, achieving competitive classification performance while surpassing existing models in interpretability metrics.

Authors:Achille Nazaret, David Blei
Title: Extremely Greedy Equivalence Search
Abstract:
The goal of causal discovery is to learn a directed acyclic graph from data. One of the most well-known methods for this problem is Greedy Equivalence Search (GES). GES searches for the graph by incrementally and greedily adding or removing edges to maximize a model selection criterion. It has strong theoretical guarantees on infinite data but can fail in practice on finite data. In this paper, we first identify some of the causes of GES's failure, finding that it can get blocked in local optima, especially in denser graphs. We then propose eXtremely Greedy Equivalent Search (XGES), which involves a new heuristic to improve the search strategy of GES while retaining its theoretical guarantees. In particular, XGES favors deleting edges early in the search over inserting edges, which reduces the possibility of the search ending in local optima. A further contribution of this work is an efficient algorithmic formulation of XGES (and GES). We benchmark XGES on simulated datasets with known ground truth. We find that XGES consistently outperforms GES in recovering the correct graphs, and it is 10 times faster. XGES implementations in Python and C++ are available at https://github.com/ANazaret/XGES.
Chinese: 本文提出XGES方法,通过优先删除边的启发式策略改进GES的搜索过程,在保持理论保证的同时有效避免局部最优,显著提升了因果图发现的准确性和计算效率。
English: This paper introduces XGES, an enhanced version of GES that uses a heuristic favoring edge deletion to avoid local optima, improving both accuracy and speed in causal discovery on finite data.

Authors:Yuxin Liu, M. Amin Rahimian
Title: Privacy-Aware Sequential Learning
Abstract:
In settings like vaccination registries, individuals act after observing others, and the resulting public records can expose private information. We study privacy-preserving sequential learning, where agents add endogenous noise to their reported actions to conceal private signals. Efficient social learning relies on information flow, seemingly in conflict with privacy. Surprisingly, with continuous signals and a fixed privacy budget $(ε)$, the optimal randomization strategy balances privacy and accuracy, accelerating learning to $Θ_ε(\log n)$, faster than the nonprivate $Θ(\sqrt{\log n})$ rate. In the nonprivate baseline, the expected time to the first correct action and the number of incorrect actions diverge; under privacy with sufficiently small $ε$, both are finite. Privacy helps because, under the false state, agents more often receive signals contradicting the majority; randomization then asymmetrically amplifies the log-likelihood ratio, enhancing aggregation. In heterogeneous populations, an order-optimal $Θ(\sqrt{n})$ rate is achievable when a subset of agents have low privacy budgets. With binary signals, however, privacy reduces informativeness and impairs learning relative to the nonprivate baseline, though the dependence on $ε$ is nonmonotone. Our results show how privacy reshapes information dynamics and inform the design of platforms and policies.
中文: 在连续信号和固定隐私预算下,隐私保护的序贯学习通过平衡隐私与准确性,意外地将社会学习加速至对数速率;而在二元信号下,尽管对隐私参数的依赖非单调,但隐私通常会削弱学习效果。
English: Privacy-preserving sequential learning with continuous signals and a fixed privacy budget can surprisingly accelerate social learning to a logarithmic rate by balancing privacy and accuracy, while with binary signals it generally impairs learning despite nonmonotonic dependence on the privacy parameter.

Authors:Yucheng Zhang, Beatrice Bevilacqua, Mikhail Galkin, Bruno Ribeiro
Title: TRIX: A More Expressive Model for Zero-shot Domain Transfer in Knowledge Graphs
Abstract:
Fully inductive knowledge graph models can be trained on multiple domains and subsequently perform zero-shot knowledge graph completion (KGC) in new unseen domains. This is an important capability towards the goal of having foundation models for knowledge graphs. In this work, we introduce a more expressive and capable fully inductive model, dubbed TRIX, which not only yields strictly more expressive triplet embeddings (head entity, relation, tail entity) compared to state-of-the-art methods, but also introduces a new capability: directly handling both entity and relation prediction tasks in inductive settings. Empirically, we show that TRIX outperforms the state-of-the-art fully inductive models in zero-shot entity and relation predictions in new domains, and outperforms large-context LLMs in out-of-domain predictions. The source code is available at https://github.com/yuchengz99/TRIX.
Chinese: TRIX是一种完全归纳的知识图谱模型,在表达能力和性能上超越现有最优方法,在新领域的零样本实体和关系预测中表现卓越,并在跨领域预测任务中优于大型上下文语言模型。
English: TRIX is a fully inductive knowledge graph model that surpasses state-of-the-art methods in expressiveness and capability, excelling in zero-shot entity and relation predictions across new domains and outperforming large-context LLMs in out-of-domain tasks.

Authors:Kunato Nishina, Yusuke Matsui
Title: SVGEditBench V2: A Benchmark for Instruction-based SVG Editing
Abstract:
Vector format has been popular for representing icons and sketches. It has also been famous for design purposes. Regarding image editing, research on vector graphics editing rarely exists in contrast with the raster counterpart. We considered the reason to be the lack of datasets and benchmarks. Thus, we propose SVGEditBench V2, a benchmark dataset for instruction-based SVG editing. SVGEditBench V2 comprises triplets of an original image, a ground truth image, and the editing prompt. We built the dataset by first extracting image pairs from various SVG emoji datasets. Then, we had GPT-4o to create the prompt. We found that triplets gained by this simple pipeline contain varying sorts of editing tasks. Additionally, we performed the editing tasks with existing LLMs and investigated how those current methods can perform SVG editing. Although there were some successful cases, we found that there is a massive room for improvement.
中文: 摘要介绍了SVGEditBench V2,这是一个用于基于指令的SVG编辑的新基准数据集,旨在解决矢量图形编辑中数据集和基准的缺乏问题,并评估了当前LLMs在此类任务上的表现,发现仍有巨大的改进空间。
English: The abstract introduces SVGEditBench V2, a new benchmark dataset for instruction-based SVG editing, created to address the scarcity of datasets and benchmarks in vector graphics editing, and evaluates current LLMs' performance on these tasks, revealing significant room for improvement.

Authors:Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, Julian McAuley
Title: Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
Abstract:
In large language models (LLMs), code and reasoning reinforce each other: code offers an abstract, modular, and logic-driven structure that supports reasoning, while reasoning translates high-level goals into smaller, executable steps that drive more advanced code intelligence. In this study, we examine how code serves as a structured medium for enhancing reasoning: it provides verifiable execution paths, enforces logical decomposition, and enables runtime validation. We also explore how improvements in reasoning have transformed code intelligence from basic completion to advanced capabilities, enabling models to address complex software engineering tasks through planning and debugging. Finally, we identify key challenges and propose future research directions to strengthen this synergy, ultimately improving LLM's performance in both areas.
中文摘要:代码与推理在大语言模型中相互促进,代码提供结构化逻辑框架,推理通过规划和调试实现高级代码智能,未来研究将致力于强化这种协同作用以提升模型性能。
English Summary: Code and reasoning mutually enhance each other in large language models, with code providing structured logical frameworks and reasoning enabling complex code intelligence through planning and debugging, while future research aims to strengthen this synergy.

Authors:Danae Sánchez Villegas, Ingo Ziegler, Desmond Elliott
Title: ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Abstract:
Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.
中文: ImageChain通过将图像序列建模为多轮对话,增强了多模态大语言模型的顺序推理能力,在下一场景描述等任务中显著提升性能并实现强大的跨领域泛化。
English: ImageChain enhances multimodal large language models by modeling image sequences as multi-turn conversations, significantly improving sequential reasoning and achieving robust performance in tasks like next-scene description.

Authors:Oğuzhan Ersoy, Jari Kolehmainen, Gabriel Passamani Andrade
Title: HDEE: Heterogeneous Domain Expert Ensemble
Abstract:
Training dense LLMs requires enormous amounts of data and centralized compute, which introduces fundamental bottlenecks and ever-growing costs for large models. Several studies aim to reduce this dependency on centralization by reducing the communication overhead of training dense models. Taking this idea of reducing communication overhead to a natural extreme, by training embarrassingly parallelizable ensembles of small independent experts, has been shown to outperform large dense models trained in traditional centralized settings. However, existing studies do not take into account underlying differences amongst data domains and treat them as monolithic, regardless of their underlying complexity, size, or distribution. In this paper, we explore the effects of introducing heterogeneity to these ensembles of domain expert models. Specifically, by allowing models within the ensemble to vary in size--as well as the number of training steps taken depending on the training data's domain--we study the effect heterogeneity has on these ensembles when evaluated against domains included in, and excluded from, the training set. We use the same compute budget to train heterogeneous ensembles and homogeneous baselines for comparison. We show that the heterogeneous ensembles achieve the lowest perplexity scores in $20$ out of the $21$ data domains used in the evaluation. Our code is available at https://github.com/gensyn-ai/hdee.
中文: 研究表明,在相同计算预算下,根据数据领域调整模型规模和训练步长的异构专家集成模型,在大多数评估领域中比同构模型表现更优,困惑度更低。
English: This study demonstrates that heterogeneous ensembles of domain experts, which vary in model size and training steps based on data domain, outperform homogeneous models by achieving lower perplexity in most evaluated domains under the same computational budget.

Authors:Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li
Title: Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
Abstract:
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).
中文: 本文提出代理奖励建模方法,将人类偏好奖励与可验证的正确性信号(如事实性和指令遵循)相结合,为大型语言模型提供更可靠的奖励系统,实验证明其在各项基准测试和实际任务中显著优于传统奖励模型。
English: This paper introduces agentic reward modeling, which integrates human preference rewards with verifiable correctness signals like factuality and instruction following to create more reliable reward systems for large language models, demonstrating superior performance over traditional methods in experiments and downstream tasks.

Authors:Adam Celarek, George Kopanas, George Drettakis, Michael Wimmer, Bernhard Kerbl
Title: Does 3D Gaussian Splatting Need Accurate Volumetric Rendering?
Abstract:
Since its introduction, 3D Gaussian Splatting (3DGS) has become an important reference method for learning 3D representations of a captured scene, allowing real-time novel-view synthesis with high visual quality and fast training times. Neural Radiance Fields (NeRFs), which preceded 3DGS, are based on a principled ray-marching approach for volumetric rendering. In contrast, while sharing a similar image formation model with NeRF, 3DGS uses a hybrid rendering solution that builds on the strengths of volume rendering and primitive rasterization. A crucial benefit of 3DGS is its performance, achieved through a set of approximations, in many cases with respect to volumetric rendering theory. A naturally arising question is whether replacing these approximations with more principled volumetric rendering solutions can improve the quality of 3DGS. In this paper, we present an in-depth analysis of the various approximations and assumptions used by the original 3DGS solution. We demonstrate that, while more accurate volumetric rendering can help for low numbers of primitives, the power of efficient optimization and the large number of Gaussians allows 3DGS to outperform volumetric rendering despite its approximations.
中文: 3D高斯泼溅通过高效优化和大量高斯函数,在实时新视角合成中展现出优越性能,尽管存在近似处理,仍超越了更精确的体积渲染方法。
English: 3D Gaussian Splatting achieves superior performance in real-time novel-view synthesis through efficient optimization and numerous Gaussians, outperforming more accurate volumetric rendering despite its approximations.

Authors:Guoqing Chao, Kaixin Xu, Xijiong Xie, Yongyong Chen
Title: Global Graph Propagation with Hierarchical Information Transfer for Incomplete Contrastive Multi-view Clustering
Abstract:
Incomplete multi-view clustering has become one of the important research problems due to the extensive missing multi-view data in the real world. Although the existing methods have made great progress, there are still some problems: 1) most methods cannot effectively mine the information hidden in the missing data; 2) most methods typically divide representation learning and clustering into two separate stages, but this may affect the clustering performance as the clustering results directly depend on the learned representation. To address these problems, we propose a novel incomplete multi-view clustering method with hierarchical information transfer. Firstly, we design the view-specific Graph Convolutional Networks (GCN) to obtain the representation encoding the graph structure, which is then fused into the consensus representation. Secondly, considering that one layer of GCN transfers one-order neighbor node information, the global graph propagation with the consensus representation is proposed to handle the missing data and learn deep representation. Finally, we design a weight-sharing pseudo-classifier with contrastive learning to obtain an end-to-end framework that combines view-specific representation learning, global graph propagation with hierarchical information transfer, and contrastive clustering for joint optimization. Extensive experiments conducted on several commonly-used datasets demonstrate the effectiveness and superiority of our method in comparison with other state-of-the-art approaches. The code is available at https://github.com/KelvinXuu/GHICMC.
中文: 本文提出了一种新颖的基于层次信息传递的不完整多视图聚类方法,通过图卷积网络、全局图传播和对比学习构建端到端框架,在有效处理缺失数据的同时实现表示学习与聚类的联合优化。
English: This paper introduces a novel incomplete multi-view clustering method using hierarchical information transfer, which integrates graph convolutional networks, global graph propagation, and contrastive learning into an end-to-end framework to jointly optimize representation learning and clustering while effectively handling missing data.

Authors:Zhenyi Zhu, Yuchen Huang, Liu Liu
Title: PhysicsSolver: Transformer-Enhanced Physics-Informed Neural Networks for Forward and Forecasting Problems in Partial Differential Equations
Abstract:
Time-dependent partial differential equations are a significant class of equations that describe the evolution of various physical phenomena over time. One of the open problems in scientific computing is predicting the behaviour of the solution outside the given temporal region. Most traditional numerical methods are applied to a given time-space region and can only accurately approximate the solution of the given region. To address this problem, many deep learning-based methods, basically data-driven and data-free approaches, have been developed to solve these problems. However, most data-driven methods require a large amount of data, which consumes significant computational resources and fails to utilize all the necessary information embedded underlying the partial differential equations (PDEs). Moreover, data-free approaches such as Physics-Informed Neural Networks (PINNs) may not be that ideal in practice, as traditional PINNs, which primarily rely on multilayer perceptrons (MLPs) and convolutional neural networks (CNNs), tend to overlook the crucial temporal dependencies inherent in real-world physical systems. We propose a method denoted as \textbf{PhysicsSolver} that merges the strengths of two approaches: data-free methods that can learn the intrinsic properties of physical systems without using data, and data-driven methods, which are effective at making predictions. Extensive numerical experiments have demonstrated the efficiency and robustness of our proposed method. We provide the code at \href{https://github.com/PhysicsSolver/PhysicsSolver}{https://github.com/PhysicsSolver}.
中文: 该摘要提出PhysicsSolver方法,融合无数据和数据驱动两种方法的优势,解决了传统数值方法在预测时间相关偏微分方程时域外解的局限性,并通过大量数值实验验证了其高效性和鲁棒性。
English: This abstract introduces PhysicsSolver, a hybrid method combining data-free and data-driven approaches to overcome limitations in predicting solutions of time-dependent PDEs beyond given temporal regions, demonstrating superior efficiency and robustness in numerical experiments.

Authors:Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
Title: CritiQ: Mining Data Quality Criteria from Human Preferences
Abstract:
Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
中文:CritiQ是一种新颖的数据选择方法,仅需少量人工标注即可自动从人类偏好中提取质量标准并高效筛选数据,在代码、数学和逻辑领域优于现有方法,同时提升模型性能。
English: CritiQ is a novel data selection method that automatically derives quality criteria from minimal human feedback and efficiently selects high-quality data, outperforming existing approaches in code, math, and logic domains while improving model performance.

Authors:Nadya Abdel Madjid, Murad Mebrahtu, Abdulrahman Ahmad, Abdelmoamen Nasser, Bilal Hassan, Naoufel Werghi, Jorge Dias, Majid Khonji
Title: EMT: A Visual Multi-Task Benchmark Dataset for Autonomous Driving
Abstract:
This paper introduces the Emirates Multi-Task (EMT) dataset, designed to support multi-task benchmarking within a unified framework. It comprises over 30,000 frames from a dash-camera perspective and 570,000 annotated bounding boxes, covering approximately 150 kilometers of driving routes that reflect the distinctive road topology, congestion patterns, and driving behavior of Gulf region traffic. The dataset supports three primary tasks: tracking, trajectory forecasting, and intention prediction. Each benchmark is accompanied by corresponding evaluations: (1) multi-agent tracking experiments addressing multi-class scenarios and occlusion handling; (2) trajectory forecasting evaluation using deep sequential and interaction-aware models; and (3) intention prediction experiments based on observed trajectories. The dataset is publicly available at https://avlab.io/emt-dataset, with pre-processing scripts and evaluation models at https://github.com/AV-Lab/emt-dataset.
中文: EMT数据集包含3万多帧行车记录仪图像和57万个标注框,专为海湾地区交通设计,支持追踪、轨迹预测和意图识别三大任务,并提供公开数据和评估工具。
English: The EMT dataset provides over 30,000 dash-camera frames and 570,000 annotations for multi-task benchmarking in Gulf region traffic, supporting tracking, trajectory forecasting, and intention prediction with public access to data and tools.

Authors:Zhiqiang Wang, Haoyu Wang, Lu Hao
Title: Poster: Long PHP webshell files detection based on sliding window attention
Abstract:
Webshell is a type of backdoor, and web applications are widely exposed to webshell injection attacks. Therefore, it is important to study webshell detection techniques. In this study, we propose a webshell detection method. We first convert PHP source code to opcodes and then extract Opcode Double-Tuples (ODTs). Next, we combine CodeBert and FastText models for feature representation and classification. To address the challenge that deep learning methods have difficulty detecting long webshell files, we introduce a sliding window attention mechanism. This approach effectively captures malicious behavior within long files. Experimental results show that our method reaches high accuracy in webshell detection, solving the problem of traditional methods that struggle to address new webshell variants and anti-detection techniques.
中文: 本研究提出一种Webshell检测方法,通过将PHP代码转换为操作码并提取操作码双元组,结合CodeBert和FastText模型及滑动窗口注意力机制,有效识别长文件中的恶意行为,实验表明该方法对新变种和反检测技术具有高检测精度。
English: This study introduces a webshell detection method that converts PHP code to opcodes, extracts ODTs, and integrates CodeBert with FastText using a sliding window attention mechanism to effectively identify malicious behavior in long files, achieving high accuracy against new variants and anti-detection techniques.

Authors:Li Ju, Xingyi Yang, Qi Li, Xinchao Wang
Title: GraphBridge: Towards Arbitrary Transfer Learning in GNNs
Abstract:
Graph neural networks (GNNs) are conventionally trained on a per-domain, per-task basis. It creates a significant barrier in transferring the acquired knowledge to different, heterogeneous data setups. This paper introduces GraphBridge, a novel framework to enable knowledge transfer across disparate tasks and domains in GNNs, circumventing the need for modifications to task configurations or graph structures. Specifically, GraphBridge allows for the augmentation of any pre-trained GNN with prediction heads and a bridging network that connects the input to the output layer. This architecture not only preserves the intrinsic knowledge of the original model but also supports outputs of arbitrary dimensions. To mitigate the negative transfer problem, GraphBridge merges the source model with a concurrently trained model, thereby reducing the source bias when applied to the target domain. Our method is thoroughly evaluated across diverse transfer learning scenarios, including Graph2Graph, Node2Node, Graph2Node, and graph2point-cloud. Empirical validation, conducted over 16 datasets representative of these scenarios, confirms the framework's capacity for task- and domain-agnostic transfer learning within graph-like data, marking a significant advancement in the field of GNNs. Code is available at https://github.com/jujulili888/GraphBridge.
中文: GraphBridge提出了一种新颖框架,能够在图神经网络中实现跨任务和跨领域的知识迁移,无需修改图结构,通过模型融合有效缓解负迁移问题,并在多种场景下展现出卓越性能。
English: GraphBridge introduces a novel framework enabling knowledge transfer across disparate tasks and domains in graph neural networks without requiring structural modifications, effectively mitigating negative transfer through model merging and demonstrating robust performance across diverse scenarios.

Authors:Haoxin Cai, Shenghai Yuan, Xinyi Li, Junfeng Guo, Jianqi Liu
Title: BEV-LIO(LC): BEV Image Assisted LiDAR-Inertial Odometry with Loop Closure
Abstract:
This work introduces BEV-LIO(LC), a novel LiDAR-Inertial Odometry (LIO) framework that combines Bird's Eye View (BEV) image representations of LiDAR data with geometry-based point cloud registration and incorporates loop closure (LC) through BEV image features. By normalizing point density, we project LiDAR point clouds into BEV images, thereby enabling efficient feature extraction and matching. A lightweight convolutional neural network (CNN) based feature extractor is employed to extract distinctive local and global descriptors from the BEV images. Local descriptors are used to match BEV images with FAST keypoints for reprojection error construction, while global descriptors facilitate loop closure detection. Reprojection error minimization is then integrated with point-to-plane registration within an iterated Extended Kalman Filter (iEKF). In the back-end, global descriptors are used to create a KD-tree-indexed keyframe database for accurate loop closure detection. When a loop closure is detected, Random Sample Consensus (RANSAC) computes a coarse transform from BEV image matching, which serves as the initial estimate for Iterative Closest Point (ICP). The refined transform is subsequently incorporated into a factor graph along with odometry factors, improving the global consistency of localization. Extensive experiments conducted in various scenarios with different LiDAR types demonstrate that BEV-LIO(LC) outperforms state-of-the-art methods, achieving competitive localization accuracy. Our code and video can be found at https://github.com/HxCa1/BEV-LIO-LC.
Chinese: BEV-LIO(LC)是一种新颖的激光雷达惯性里程计框架,通过将BEV图像表示与点云配准及闭环检测相结合,并采用iEKF优化和因子图优化,在各种场景中实现了优于现有方法的定位精度。
English: BEV-LIO(LC) is a novel LiDAR-inertial odometry framework that integrates BEV image representations with point cloud registration and loop closure detection, achieving superior localization accuracy across diverse scenarios through a combination of iEKF optimization and factor graph refinement.

Authors:Jiazheng Li, Yuxiang Zhou, Junru Lu, Gladys Tyen, Lin Gui, Cesare Aloisi, Yulan He
Title: Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
Abstract:
Although preference optimization methods have improved reasoning performance in Large Language Models (LLMs), they often lack transparency regarding why one reasoning outcome is preferred over another. This limitation is especially critical in Automated Student Answer Scoring (ASAS), where explainability is essential to justify assessment outcomes. Verbal reinforcement learning offers the potential to generate explicit reflection, but it tends to produce superficial critiques that can harm assessment performance. Existing LLMs also struggle to reliably detect subtle reasoning errors in ASAS tasks. Moreover, manually identifying intermediate reasoning errors is expensive and difficult to scale. To address these challenges, we introduce a contrastive reflection synthesis pipeline that generates precise verbal feedback by identifying discrepancies in structure reasoning graph paths. Leveraging these synthetic reflection data, we propose DARS, a Dual-model Reflective Scoring framework featuring a dedicated Critic model trained for effective reflection. DARS achieves strong performance and consistently outperforms existing ASAS baselines across all evaluation metrics. Extensive experiments further provide novel insights into the value of reflection data, framework design, and the scaling behavior of DARS. We release the DARS code at https://github.com/lijiazheng99/DARS.
中文摘要:我们提出的双模型框架通过对比反思合成和语言强化学习,采用专门化的推理器与批判器分工协作,相比传统单模型方法显著提升了推理准确性、透明度及整体性能,验证了“两人智慧胜一人”的优势。
English Summary: Our proposed dual-model framework, featuring specialized Reasoner and Critic models, significantly improves reasoning accuracy, transparency, and performance over traditional single-model approaches by enhancing reflection quality through contrastive synthesis and verbal reinforcement learning.

Authors:Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang
Title: Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.
Chinese: Bi'an框架通过提供双语基准数据集和轻量级评判模型,解决了RAG幻觉检测中的局限性,其140亿参数模型在性能上超越了规模更大的基线模型,并能与顶尖闭源大语言模型相媲美。
English: The Bi'an framework addresses limitations in RAG hallucination detection by providing a bilingual benchmark dataset and lightweight judge models, with its 14B parameter model outperforming larger baselines and competing with top closed-source LLMs.

Authors:Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran, Quoc V. Le, Orhan Firat
Title: BIG-Bench Extra Hard
Abstract:
Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
Chinese: 针对现有推理基准如BIG-Bench Hard的性能饱和问题,新推出的BIG-Bench Extra Hard(BBEH)基准通过引入更具挑战性的任务,揭示了当前大语言模型在通用推理能力上的显著差距,凸显了持续改进的必要性。
English: To address the saturation of existing reasoning benchmarks like BIG-Bench Hard, the new BIG-Bench Extra Hard (BBEH) benchmark introduces more challenging tasks, revealing significant performance gaps and underscoring the ongoing need for improved general reasoning in large language models.

Authors:Mohammad Moulaeifard, Peter H. Charlton, Nils Strodthoff
Title: Generalizable deep learning for photoplethysmography-based blood pressure estimation -- A Benchmarking Study
Abstract:
Photoplethysmography (PPG)-based blood pressure (BP) estimation represents a promising alternative to cuff-based BP measurements. Recently, an increasing number of deep learning models have been proposed to infer BP from the raw PPG waveform. However, these models have been predominantly evaluated on in-distribution test sets, which immediately raises the question of the generalizability of these models to external datasets. To investigate this question, we trained five deep learning models on the recently released PulseDB dataset, provided in-distribution benchmarking results on this dataset, and then assessed out-of-distribution performance on several external datasets. The best model (XResNet1d101) achieved in-distribution MAEs of 9.4 and 6.0 mmHg for systolic and diastolic BP respectively on PulseDB (with subject-specific calibration), and 14.0 and 8.5 mmHg respectively without calibration. Equivalent MAEs on external test datasets without calibration ranged from 15.0 to 25.1 mmHg (SBP) and 7.0 to 10.4 mmHg (DBP). Our results indicate that the performance is strongly influenced by the differences in BP distributions between datasets. We investigated a simple way of improving performance through sample-based domain adaptation and put forward recommendations for training models with good generalization properties. With this work, we hope to educate more researchers for the importance and challenges of out-of-distribution generalization.
Chinese: 本研究评估了基于光电容积脉搏波描记法的深度学习模型在血压估计中的泛化能力,发现模型在外部数据集上性能显著下降,并提出了领域自适应方法以提高模型的鲁棒性。
English: This study evaluates the generalizability of deep learning models for photoplethysmography-based blood pressure estimation, revealing significant performance drops on external datasets and proposing domain adaptation methods to enhance model robustness.

Authors:Kaiwen Yan, Hongcheng Guo, Xuanqing Shi, Shaosheng Cao, Donglin Di, Zhoujun Li
Title: CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation
Abstract:
With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated testing, but also augments developer efficiency through improved maintainability and reusability of code. In this paper, we introduce CodeIF, the first benchmark specifically designed to assess the abilities of LLMs to adhere to task-oriented instructions within diverse code generation scenarios. CodeIF encompasses a broad range of tasks, including function synthesis, error debugging, algorithmic refactoring, and code explanation, thereby providing a comprehensive suite to evaluate model performance across varying complexity levels and programming domains. We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks. The experimental results offer valuable insights into how well current models align with human instructions, as well as the extent to which they can generate consistent, maintainable, and contextually relevant code. Our findings not only underscore the critical role that instruction-following LLMs can play in modern software development, but also illuminate pathways for future research aimed at enhancing their adaptability, reliability, and overall effectiveness in automated code generation. CodeIF data and code are publicly available: https://github.com/lin-rany/codeIF
中文: 本文介绍了首个专门评估大语言模型在多样化代码生成场景中遵循指令能力的基准CodeIF,通过广泛实验揭示了模型在任务执行中的优势与不足,并强调了其在现代软件开发中的关键作用及未来研究方向。
English: This paper introduces CodeIF, the first benchmark designed to evaluate how well Large Language Models follow instructions in code generation tasks across diverse scenarios like function synthesis and debugging, revealing their strengths and limitations while highlighting their potential role in software development.

Authors:Henry Peng Zou, Zhengyao Gu, Yue Zhou, Yankai Chen, Weizhi Zhang, Liancheng Fang, Yibo Wang, Yangning Li, Kay Liu, Philip S. Yu
Title: TestNUC: Enhancing Test-Time Computing Approaches and Scaling through Neighboring Unlabeled Data Consistency
Abstract:
Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model's prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at https://github.com/HenryPengZou/TestNUC.
中文: TestNUC是一种创新的测试时计算方法,通过利用相邻未标记数据的局部一致性来提高预测精度,在多个数据集上表现优于基线方法,并能与现有方法无缝集成。
English: TestNUC is a novel test-time computing method that enhances prediction accuracy by leveraging the local consistency of neighboring unlabeled data, demonstrating consistent superiority across diverse datasets and seamless integration with existing approaches.

Authors:Xuan Ding, Rui Sun, Yunjian Zhang, Xiu Yan, Yueqi Zhou, Kaihao Huang, Suzhong Fu, Angelica I Aviles-Rivero, Chuanlong Xie, Yao Zhu
Title: A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs
Abstract:
Compared to width-wise pruning, depth-wise pruning can significantly accelerate inference in resource-constrained scenarios. However, treating the entire Transformer layer as the minimum pruning unit may degrade model performance by indiscriminately discarding the entire information of the layer. This paper reveals the ``Patch-like'' feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. Building on this observation, we propose a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35% pruning on the Vicuna-7B model, our method achieved a 1.654% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at https://github.com/920927/SLM-a-sliding-layer-merging-method.
中文: 深度剪枝虽能加速推理,但移除整个层会损害性能,因此本文提出滑动层合并方法,基于相似度动态选择并融合连续层,在零样本任务和剪枝后重训练恢复上优于现有技术,如在Vicuna-7B模型上实现1.654%的平均性能提升。
English: Depth-wise pruning accelerates inference but risks performance loss by removing entire layers, so this paper introduces a sliding layer merging method that selectively fuses consecutive layers based on similarity, outperforming existing techniques in zero-shot tasks and retraining recovery, as shown in experiments including a 1.654% improvement on Vicuna-7B.

Authors:Junlong Ren, Hao Wu, Hui Xiong, Hao Wang
Title: SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation
Abstract:
The cross-modal 3D retrieval task aims to achieve mutual matching between text descriptions and 3D shapes. This has the potential to enhance the interaction between natural language and the 3D environment, especially within the realms of robotics and embodied artificial intelligence (AI) applications. However, the scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods. These methods heavily rely on features derived from the limited number of 3D shapes, resulting in poor generalization ability across diverse scenarios. To address this challenge, we introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a component library, captioning each segmented part of every 3D shape within the dataset. Notably, it facilitates the generation of extensive new 3D-text pairs containing new semantic features. We employ both inter and intra distances to align various components into a new 3D shape, ensuring that the components do not overlap and are closely fitted. Further, text templates are utilized to process the captions of each component and generate new text descriptions. Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts based on the enriched dataset. We then calculate fine-grained cross-modal similarity using Earth Mover's Distance (EMD) and enhance cross-modal matching with contrastive learning, enabling bidirectional retrieval between texts and 3D shapes. Extensive experiments show our SCA3D outperforms previous works on the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to 27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found in https://github.com/3DAgentWorld/SCA3D.
中文: SCA3D提出了一种新颖的在线数据增强方法,通过组件分割和描述生成丰富的3D-文本对,利用对比学习和Earth Mover's Distance对齐显著提升了跨模态3D检索性能。
English: SCA3D introduces a novel online data augmentation method that generates enriched 3D-text pairs using component segmentation and captioning, significantly improving cross-modal 3D retrieval performance through contrastive learning and Earth Mover's Distance alignment.

Authors:Ziyuan Luo, Anderson Rocha, Boxin Shi, Qing Guo, Haoliang Li, Renjie Wan
Title: The NeRF Signature: Codebook-Aided Watermarking for Neural Radiance Fields
Abstract:
Neural Radiance Fields (NeRF) have been gaining attention as a significant form of 3D content representation. With the proliferation of NeRF-based creations, the need for copyright protection has emerged as a critical issue. Although some approaches have been proposed to embed digital watermarks into NeRF, they often neglect essential model-level considerations and incur substantial time overheads, resulting in reduced imperceptibility and robustness, along with user inconvenience. In this paper, we extend the previous criteria for image watermarking to the model level and propose NeRF Signature, a novel watermarking method for NeRF. We employ a Codebook-aided Signature Embedding (CSE) that does not alter the model structure, thereby maintaining imperceptibility and enhancing robustness at the model level. Furthermore, after optimization, any desired signatures can be embedded through the CSE, and no fine-tuning is required when NeRF owners want to use new binary signatures. Then, we introduce a joint pose-patch encryption watermarking strategy to hide signatures into patches rendered from a specific viewpoint for higher robustness. In addition, we explore a Complexity-Aware Key Selection (CAKS) scheme to embed signatures in high visual complexity patches to enhance imperceptibility. The experimental results demonstrate that our method outperforms other baseline methods in terms of imperceptibility and robustness. The source code is available at: https://github.com/luo-ziyuan/NeRF_Signature.
中文: 本文提出NeRF Signature方法,通过码本辅助签名嵌入和联合位姿-补丁加密策略,在不改变模型结构的情况下为神经辐射场嵌入难以察觉且鲁棒的数字水印,有效提升版权保护的隐蔽性和稳健性。
English: This paper introduces NeRF Signature, a novel watermarking method that embeds imperceptible and robust digital signatures into Neural Radiance Fields without altering the model structure, using a codebook-aided approach and joint pose-patch encryption strategy.

Authors:Michelle Kappl
Title: Are All Spanish Doctors Male? Evaluating Gender Bias in German Machine Translation
Abstract:
We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1, we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under https://github.com/michellekappl/mt_gender_german.
中文摘要:WinoMTDE是一个用于评估德语机器翻译系统中职业性别偏见的数据集,研究发现大多数模型存在持续偏见,其中大型语言模型表现最优。
English Summary: WinoMTDE is a German gender bias evaluation dataset designed to test occupational stereotyping in machine translation systems, revealing persistent biases in most models with large language models performing best.

Authors:Siwei Wu, Yizhi Li, Xingwei Qu, Rishi Ravikumar, Yucheng Li, Tyler Loakman, Shanghaoran Quan, Xiaoyong Wei, Riza Batista-Navarro, Chenghua Lin
Title: LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm
Abstract:
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases. To quantitively locate such a performance degradation and provide further insights on model development, we present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms, inspired by cognitive and linguistic writing models. The comprehensive experiments in this work reveal interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well-trained on long texts, has comparable performance. All code and datasets are released in https://github.com/Wusiwei0410/LongEval.
中文: 大型语言模型在生成长文本方面存在困难,为此我们开发了LongEval基准,通过不同生成方法评估性能,发现经过长文本训练的小型模型也能达到大型模型的水平。
English: Large Language Models struggle with long-text generation, leading to the creation of LongEval, a benchmark that evaluates performance across different generation methods and reveals that smaller models trained on long texts can match larger ones.

Authors:Qingyao Tian, Huai Liao, Xinyan Huang, Bingyu Yang, Dongdong Lei, Sebastien Ourselin, Hongbin Liu
Title: EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training
Abstract:
Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code is available at https://github.com/TianCuteQY/EndoMamba.
中文: EndoMamba是一种计算高效的基础模型,通过结合双向和普通Mamba模块进行时空建模,并采用分层预训练方法,在保持实时推理的同时,显著提升了内窥镜视频分析在分类、分割、手术阶段识别和定位等多种任务中的性能表现。
English: EndoMamba is a computationally efficient foundation model that overcomes limitations in endoscopic video analysis by integrating bidirectional and vanilla Mamba blocks for spatiotemporal modeling and employing hierarchical pre-training to achieve superior performance across multiple surgical tasks while maintaining real-time inference.

Authors:Fraser Birks, Thomas D Swinburne, James R Kermode
Title: Efficient and Accurate Spatial Mixing of Machine Learned Interatomic Potentials for Materials Science
Abstract:
Machine-learned interatomic potentials offer near first-principles accuracy but are computationally expensive, limiting their application in large-scale molecular dynamics simulations. Inspired by quantum mechanics/molecular mechanics methods, we present ML-MIX, an efficient and flexible LAMMPS package for accelerating simulations by spatially mixing interatomic potentials of different complexities. Through constrained linear fitting, we show it is possible to generate a 'cheap' approximate model which closely matches an 'expensive' reference in relevant regions of configuration space. We demonstrate the capability of ML-MIX through case-studies in Si, Fe, and W-He systems, achieving up to an 11x speedup on 8,000 atom systems without sacrificing accuracy on static and dynamic quantities, including calculation of minimum energy paths and dynamical simulations of defect diffusion. For larger domain sizes, we show that the achievable speedup of ML-MIX simulations is limited only by the relative speed of the cheap potential over the expensive potential. The ease of use and flexible nature of this method will extend the practical reach of MLIPs throughout computational materials science, enabling parsimonious application to large spatial and temporal domains.
中文摘要:ML-MIX是一个LAMMPS软件包,通过混合不同复杂度的原子间势能来加速分子动力学模拟,在硅和铁等材料系统中实现了高达11倍的加速,同时保持计算精度。
English Summary: ML-MIX is a LAMMPS package that accelerates molecular dynamics simulations by mixing interatomic potentials of varying complexities, achieving up to 11x speedup while maintaining accuracy in materials systems like Si and Fe.

Authors:Yiheng Yang, Yujie Wang, Chi Ma, Lei Yu, Emmanuele Chersoni, Chu-Ren Huang
Title: Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs
Abstract:
Dense large language models(LLMs) face critical efficiency bottlenecks as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods(static pruning or dynamic activation) address this partially, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by human brain's dual-process mechanisms - predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex context - we propose CLADA, a \textit{\textbf{C}ognitive-\textbf{L}oad-\textbf{A}ware \textbf{D}ynamic \textbf{A}ctivation} framework that synergizes statistical sparsity with semantic adaptability. Our key insight is that LLM activations exhibit two complementary patterns: 1) \textit{Global statistical sparsity} driven by sequence-level prefix information, and 2) \textit{Local semantic adaptability} modulated by cognitive load metrics(e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline from offline error-controlled optimization ensures 40\%+ sparsity, dynamically adjusted by real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves \textbf{~20\% average speedup with <2\% accuracy drop}, outperforming Griffin (5\%+ degradation) and TT (negligible speedup). Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis ($R^2=0.17$ for sparsity-adaptation synergy). Requiring no retraining or architectural changes, CLADA offers a deployable solution for resource-aware LLM inference while advancing biologically-inspired AI design. Our code is available at \href{https://github.com/Oldify/CLADA}{CLADA}.
中文摘要:CLADA是一种认知负载感知的动态激活框架,通过结合统计稀疏性与语义适应性来提升大语言模型效率,在实现约20%加速的同时保持精度损失低于2%,并建立了与大脑神经语言机制的首次正式关联。
English Summary: CLADA is a cognitive-load-aware dynamic activation framework that enhances LLM efficiency by combining statistical sparsity with semantic adaptability, achieving ~20% speedup with minimal accuracy loss while establishing a neurolinguistic connection to brain mechanisms.

Authors:Ujjwal Singh, Aditi Sharma, Nikhil Gupta, Deepakshi, Vivek Kumar Jha
Title: IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation from natural language prompts, revolutionizing software development workflows. As we advance towards agent-based development paradigms, these models form the cornerstone of next-generation software development lifecycles. However, current benchmarks for evaluating multilingual code generation capabilities are predominantly English-centric, limiting their applicability across the global developer community. To address this limitation, we present IndicEval-XL, a comprehensive benchmark for code generation that incorporates 6 major Indic languages, collectively spoken by approximately 14\% of the world's population. Our benchmark bridges these languages with 12 programming languages, creating a robust evaluation framework. This work is particularly significant given India's representation of one-eighth of the global population and the crucial role Indic languages play in Indian society. IndicEval-XL represents a significant step toward expanding the linguistic diversity in code generation systems and evaluation frameworks. By developing resources that support multiple languages, we aim to make AI-powered development tools more inclusive and accessible to developers of various linguistic backgrounds. To facilitate further research and development in this direction, we make our dataset and evaluation benchmark publicly available at https://github.com/telekom/IndicEval-XL
中文:IndicEval-XL 是一个涵盖六种主要印度语言与十二种编程语言的综合性多语言代码生成基准,通过公开数据集解决了当前以英语为中心的限制,推动了人工智能开发工具的包容性发展。
English: IndicEval-XL is a comprehensive multilingual code generation benchmark that integrates six major Indic languages with twelve programming languages, addressing the current English-centric limitations and promoting inclusivity in AI development tools by making the dataset publicly available.

Authors:Hao Liang, Meiyi Qiang, Yuying Li, Zefeng He, Yongzhen Guo, Zhengzhou Zhu, Wentao Zhang, Bin Cui
Title: MathClean: A Benchmark for Synthetic Mathematical Data Cleaning
Abstract:
With the rapid development of large language models (LLMs), the quality of training data has become crucial. Among the various types of training data, mathematical data plays a key role in enabling LLMs to acquire strong reasoning abilities. While high-quality open-source data is important, it is often insufficient for pre-training, necessitating the addition of synthetic math problems. However, synthetic math questions and answers can introduce inaccuracies, which may degrade both the training data and web data. Therefore, an effective method for cleaning synthetic math data is essential. In this paper, we propose the MathClean benchmark to evaluate the effectiveness of math data cleaning models. The MathClean benchmark consists of 2,000 correct questions and 2,000 erroneous questions with additional 2,000 correct and erroneous answers sourced from augmented data based on GSM8K and MATH. Moreover, we also annotate error types for each question or answer, since it can assess whether models can correctly identify the error categories for future improvements. Finally, we present comprehensive evaluations using state-of-the-art (SOTA) models. Our results demonstrate that even strong models like GPT-o1 and DeepSeek-R1 perform poorly on this benchmark, highlighting the utility of MathClean. Our code and data is available at https://github.com/YuYingLi0/MathClean.
中文: MathClean基准被提出用于评估数学数据清洗模型,包含4000个标注的问题和答案以解决合成训练数据中的错误,测试表明即使是GPT-o1和DeepSeek-R1等先进模型在此任务上也表现不佳。
English: The MathClean benchmark is introduced to evaluate math data cleaning models, comprising 4,000 annotated questions and answers to address inaccuracies in synthetic training data, with tests showing even advanced models like GPT-o1 and DeepSeek-R1 struggle on this task.

Authors:Jiebin Yan, Ziwen Tan, Yuming Fang, Jiale Rao, Yifan Zuo
Title: Max360IQ: Blind Omnidirectional Image Quality Assessment with Multi-axis Attention
Abstract:
Omnidirectional image, also called 360-degree image, is able to capture the entire 360-degree scene, thereby providing more realistic immersive feelings for users than general 2D image and stereoscopic image. Meanwhile, this feature brings great challenges to measuring the perceptual quality of omnidirectional images, which is closely related to users' quality of experience, especially when the omnidirectional images suffer from non-uniform distortion. In this paper, we propose a novel and effective blind omnidirectional image quality assessment (BOIQA) model with multi-axis attention (Max360IQ), which can proficiently measure not only the quality of uniformly distorted omnidirectional images but also the quality of non-uniformly distorted omnidirectional images. Specifically, the proposed Max360IQ is mainly composed of a backbone with stacked multi-axis attention modules for capturing both global and local spatial interactions of extracted viewports, a multi-scale feature integration (MSFI) module to fuse multi-scale features and a quality regression module with deep semantic guidance for predicting the quality of omnidirectional images. Experimental results demonstrate that the proposed Max360IQ outperforms the state-of-the-art Assessor360 by 3.6\% in terms of SRCC on the JUFE database with non-uniform distortion, and gains improvement of 0.4\% and 0.8\% in terms of SRCC on the OIQA and CVIQ databases, respectively. The source code is available at https://github.com/WenJuing/Max360IQ.
Chinese: 本文提出了一种名为Max360IQ的新型盲视全景图像质量评估模型,通过多轴注意力模块和多尺度特征融合技术,能够有效评估均匀与非均匀失真的360度图像质量,并在多个数据库中展现出优于现有方法的性能表现。
English: This paper introduces Max360IQ, a novel blind omnidirectional image quality assessment model that effectively evaluates both uniformly and non-uniformly distorted 360-degree images by utilizing multi-axis attention modules and multi-scale feature integration, demonstrating superior performance over existing methods.

Authors:Hui Feng, Yuntzu Yin, Emiliano Reynares, Jay Nanavati
Title: OntologyRAG: Better and Faster Biomedical Code Mapping with Retrieval-Augmented Generation (RAG) Leveraging Ontology Knowledge Graphs and Large Language Models
Abstract:
Biomedical ontologies, which comprehensively define concepts and relations for biomedical entities, are crucial for structuring and formalizing domain-specific information representations. Biomedical code mapping identifies similarity or equivalence between concepts from different ontologies. Obtaining high-quality mapping usually relies on automatic generation of unrefined mapping with ontology domain fine-tuned language models (LMs), followed by manual selections or corrections by coding experts who have extensive domain expertise and familiarity with ontology schemas. The LMs usually provide unrefined code mapping suggestions as a list of candidates without reasoning or supporting evidence, hence coding experts still need to verify each suggested candidate against ontology sources to pick the best matches. This is also a recurring task as ontology sources are updated regularly to incorporate new research findings. Consequently, the need of regular LM retraining and manual refinement make code mapping time-consuming and labour intensive. In this work, we created OntologyRAG, an ontology-enhanced retrieval-augmented generation (RAG) method that leverages the inductive biases from ontological knowledge graphs for in-context-learning (ICL) in large language models (LLMs). Our solution grounds LLMs to knowledge graphs with unrefined mappings between ontologies and processes questions by generating an interpretable set of results that include prediction rational with mapping proximity assessment. Our solution doesn't require re-training LMs, as all ontology updates could be reflected by updating the knowledge graphs with a standard process. Evaluation results on a self-curated gold dataset show promises of using our method to enable coding experts to achieve better and faster code mapping. The code is available at https://github.com/iqvianlp/ontologyRAG.
中文摘要:OntologyRAG提出了一种基于知识图谱的检索增强生成方法,通过结合本体知识提升大语言模型在生物医学代码映射中的表现,无需重复训练即可提供带推理依据的可解释映射结果。
English Summary: OntologyRAG introduces a retrieval-augmented generation method that uses ontological knowledge graphs to enhance large language models for biomedical code mapping, eliminating the need for retraining and providing interpretable results with reasoning.

Authors:Shijun Zhang, Hongkai Zhao, Yimin Zhong, Haomin Zhou
Title: Fourier Multi-Component and Multi-Layer Neural Networks: Unlocking High-Frequency Potential
Abstract:
The architecture of a neural network and the selection of its activation function are both fundamental to its performance. Equally vital is ensuring these two elements are well-matched, as their alignment is key to achieving effective representation and learning. In this paper, we introduce the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN), a novel model that creates a strong synergy between them. We demonstrate that FMMNNs are highly effective and flexible in modeling high-frequency components. Our theoretical results demonstrate that FMMNNs have exponential expressive power for function approximation. We also analyze the optimization landscape of FMMNNs and find it to be much more favorable than that of standard fully connected neural networks, especially when dealing with high-frequency features. In addition, we propose a scaled random initialization method for the first layer's weights in FMMNNs, which significantly speeds up training and enhances overall performance. Extensive numerical experiments support our theoretical insights, showing that FMMNNs consistently outperform traditional approaches in accuracy and efficiency across various tasks.
中文: FMMNN模型通过架构与激活函数的协同作用,实现了指数级表达能力和更优的优化特性,在多种任务中均以更高精度和效率超越传统网络方法。
English: The FMMNN model synergizes architecture and activation functions to achieve exponential expressive power and favorable optimization, outperforming traditional networks in accuracy and efficiency across diverse tasks.

Authors:Shuyi Liu, Simiao Cui, Haoran Bu, Yuming Shang, Xi Zhang
Title: JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at https://github.com/STAIR-BUPT/JailBench.
中文摘要:JailBench是首个针对大语言模型深层漏洞的中文综合评测基准,采用自动越狱提示工程框架,相比现有基准在ChatGPT上实现了更高的攻击成功率。
English Summary: JailBench is introduced as the first comprehensive Chinese benchmark designed to evaluate deep-seated vulnerabilities in large language models, employing an automatic jailbreak prompt engineering framework to achieve higher attack success rates than existing benchmarks.

Authors:Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
Title: TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
Abstract:
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.
Chinese: TOKENSWIFT 是一种创新框架,通过消除频繁的模型重载、优化动态键值管理和减少重复生成,解决了超长序列生成中的关键瓶颈,在保持输出质量的同时,在不同规模和架构的模型上实现了超过3倍的加速效果。
English: TOKENSWIFT is a novel framework that addresses key bottlenecks in ultra-long sequence generation by eliminating frequent model reloading, optimizing KV management, and reducing repetitive generation, achieving over 3x speedup across various model scales and architectures while preserving output quality.

Authors:Jacob Dunefsky, Arman Cohan
Title: One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
Abstract:
Steering vectors (SVs) have emerged as a promising approach for interpreting and controlling LLMs, but current methods typically require large contrastive datasets that are often impractical to construct and may capture spurious correlations. We propose directly optimizing SVs through gradient descent on a single training example, and systematically investigate how these SVs generalize. We consider several SV optimization techniques and find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models. Indeed, in experiments on an alignment-faking model, we are able to optimize one-shot SVs that induce harmful behavior on benign examples and whose negations suppress harmful behavior on malign examples. And in experiments on refusal suppression, we demonstrate that one-shot optimized SVs can transfer across inputs, yielding a Harmbench attack success rate of 96.9%. Furthermore, we extend work on "emergent misalignment" and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully on unrelated open-ended prompts. Finally, we use one-shot SV optimization to investigate how an instruction-tuned LLM recovers from outputting false information, and find that this ability is independent of the model's explicit verbalization that the information was false. Overall, our findings suggest that optimizing SVs on a single example can mediate a wide array of misaligned behaviors in LLMs. Code can be found at https://github.com/jacobdunefsky/one-shot-steering-repro and https://github.com/jacobdunefsky/one-shot-steering-misalignment.
中文: 本研究提出通过单一样本的梯度下降优化导向向量,证明其能有效调控大型语言模型中的多种未对齐行为,包括安全操纵和拒绝抑制等。
English: This study introduces a method to optimize steering vectors using just one training example via gradient descent, demonstrating their effectiveness in controlling various misaligned behaviors in large language models, including safety manipulation and refusal suppression.

Authors:Jungin Kim, Shinwoo Park, Yo-Sub Han
Title: Marking Code Without Breaking It: Code Watermarking for Detecting LLM-Generated Code
Abstract:
Code watermarking identifies AI-generated code by embedding patterns into the code during generation. Effective watermarking requires meeting two key conditions: the watermark should be reliably detectable, and the code should retain its original functionality. However, existing methods often modify tokens that are critical for program logic, such as keywords in conditional expressions or operators in arithmetic computations. These modifications can cause syntax errors or functional failures, limiting the practical use of watermarking. We present STONE, a method that preserves functional integrity by selectively inserting watermarks only into non-syntax tokens. By excluding tokens essential for code execution, STONE minimizes the risk of functional degradation. In addition, we introduce CWEM, a comprehensive evaluation metric that evaluates watermarking techniques based on correctness, detectability, and naturalness. While correctness and detectability have been widely used, naturalness remains underexplored despite its importance. Unnatural patterns can reveal the presence of a watermark, making it easier for adversaries to remove. We evaluate STONE using CWEM and compare its performance with the state-of-the-art approach. The results show that STONE achieves an average improvement of 7.69% in CWEM across Python, C++, and Java. Our code is available in https://github.com/inistory/STONE-watermarking/.
中文: STONE提出了一种代码水印方法,仅在非语法标记中嵌入可检测模式以保持功能完整性,并通过CWEM评估指标在正确性、可检测性和自然性方面进行测试,结果显示其性能比现有方法平均提高了7.69%。
English: STONE introduces a code watermarking method that embeds detectable patterns only in non-syntax tokens to preserve functionality, and it is evaluated using the CWEM metric which considers correctness, detectability, and naturalness, showing a 7.69% average improvement over existing approaches.

Authors:Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, Xiangyu Zhao
Title: Sliding Window Attention Training for Efficient Large Language Models
Abstract:
Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck for processing long documents. As a result, many efforts like sparse attention and state space models have been proposed to improve the efficiency of LLMs over long sequences. Though effective, these approaches compromise the performance or introduce structural complexity. This calls for a simple yet efficient model that preserves the fundamental Transformer architecture. To this end, we introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Then, we replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks. Code is available at https://github.com/Fzkuji/swat-attention.
Chinese Summary: 本文提出SWAT模型,通过用sigmoid替换softmax并结合平衡位置嵌入,有效提升了Transformer处理长文本的能力,在多个基准测试中实现了最优性能。
English Summary: The paper introduces SWAT, a model that enhances long-context processing in Transformers by replacing softmax with sigmoid and using balanced position embeddings, achieving state-of-the-art efficiency and performance.

Authors:Yifan Hu, Yuante Li, Peiyuan Liu, Yuxia Zhu, Naiqi Li, Tao Dai, Shu-tao Xia, Dawei Cheng, Changjun Jiang
Title: FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting
Abstract:
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making, capturing valuable historical information that can be leveraged for profitable investment strategies. Not surprisingly, this area has attracted considerable attention from researchers, who have proposed a wide range of methods based on various backbones. However, the evaluation of the area often exhibits three systemic limitations: 1. Failure to account for the full spectrum of stock movement patterns observed in dynamic financial markets. (Diversity Gap), 2. The absence of unified assessment protocols undermines the validity of cross-study performance comparisons. (Standardization Deficit), and 3. Neglect of critical market structure factors, resulting in inflated performance metrics that lack practical applicability. (Real-World Mismatch). Addressing these limitations, we propose FinTSB, a comprehensive and practical benchmark for financial time series forecasting (FinTSF). To increase the variety, we categorize movement patterns into four specific parts, tokenize and pre-process the data, and assess the data quality based on some sequence characteristics. To eliminate biases due to different evaluation settings, we standardize the metrics across three dimensions and build a user-friendly, lightweight pipeline incorporating methods from various backbones. To accurately simulate real-world trading scenarios and facilitate practical implementation, we extensively model various regulatory constraints, including transaction fees, among others. Finally, we conduct extensive experiments on FinTSB, highlighting key insights to guide model selection under varying market conditions. Overall, FinTSB provides researchers with a novel and comprehensive platform for improving and evaluating FinTSF methods. The code is available at https://github.com/TongjiFinLab/FinTSBenchmark.
中文摘要:针对金融时间序列分析存在的多样性不足、标准化缺失和实际应用性差等问题,FinTSB基准通过模式分类、统一指标和监管约束建模,为改进和评估预测方法提供了全面平台。
English Summary: Financial time series analysis faces challenges in diversity, standardization, and real-world applicability, which the proposed FinTSB benchmark addresses through pattern categorization, unified metrics, and regulatory modeling to enhance forecasting methods.

Authors:Dung V. Nguyen, Minh H. Nguyen, Luc Q. Nguyen, Rachel S. Y. Teo, Tan M. Nguyen, Linh Duy Tran
Title: CAMEx: Curvature-aware Merging of Experts
Abstract:
Existing methods for merging experts during model training and fine-tuning predominantly rely on Euclidean geometry, which assumes a flat parameter space. This assumption can limit the model's generalization ability, especially during the pre-training phase, where the parameter manifold might exhibit more complex curvature. Curvature-aware merging methods typically require additional information and computational resources to approximate the Fisher Information Matrix, adding memory overhead. In this paper, we introduce CAMEx (Curvature-Aware Merging of Experts), a novel expert merging protocol that incorporates natural gradients to account for the non-Euclidean curvature of the parameter manifold. By leveraging natural gradients, CAMEx adapts more effectively to the structure of the parameter space, improving alignment between model updates and the manifold's geometry. This approach enhances both pre-training and fine-tuning, resulting in better optimization trajectories and improved generalization without the substantial memory overhead typically associated with curvature-aware methods. Our contributions are threefold: (1) CAMEx significantly outperforms traditional Euclidean-based expert merging techniques across various natural language processing tasks, leading to enhanced performance during pre-training and fine-tuning; (2) we introduce a dynamic merging architecture that optimizes resource utilization, achieving high performance while reducing computational costs, facilitating efficient scaling of large language models; and (3) we provide both theoretical and empirical evidence to demonstrate the efficiency of our proposed method. The code is publicly available at: https://github.com/kpup1710/CAMEx.
中文: 本文提出CAMEx,一种利用自然梯度适应参数流形曲率的专家合并方法,在提升模型泛化能力和优化效果的同时,避免了传统曲率感知方法的高内存开销。
English: This paper introduces CAMEx, a curvature-aware expert merging protocol that uses natural gradients to better align with the parameter manifold's geometry, enhancing model generalization and optimization without the high memory costs of traditional methods.

Authors:Shuliang Liu, Xinze Li, Zhenghao Liu, Yukun Yan, Cheng Yang, Zheni Zeng, Zhiyuan Liu, Maosong Sun, Ge Yu
Title: Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Abstract:
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.
中文: 本文提出Judge-Consistency方法,通过生成多维度判断并利用一致性筛选,有效提升大语言模型对RAG模型的评估准确性,在不同模型和数据集上均能优化RAG性能。
English: This paper introduces the Judge-Consistency (ConsJudge) method to enhance LLMs' evaluation accuracy for RAG models by generating diverse judgments and using consistency for selection, effectively improving RAG optimization across various models and datasets.

Authors:Chenyang Zhao, Kun Wang, Janet H. Hsiao, Antoni B. Chan
Title: Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Abstract:
Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: https://github.com/Cyang-Zhao/Grad-Eclip.
中文: 本文提出Grad-ECLIP,这是一种基于梯度的解释方法,通过分析中间特征为CLIP的图文匹配结果提供视觉和文本解释,经评估证明其优越性,并能在微调过程中实现更精细的对齐优化。
English: This paper introduces Grad-ECLIP, a gradient-based method that provides visual and textual explanations for CLIP's image-text matching results by analyzing intermediate features, demonstrating superior effectiveness through evaluations and enabling improved fine-grained alignment during fine-tuning.

Authors:Ruifeng Tan, Weixiang Hong, Jiayue Tang, Xibin Lu, Ruijun Ma, Xiang Zheng, Jia Li, Jiaqiang Huang, Tong-Yi Zhang
Title: BatteryLife: A Comprehensive Dataset and Benchmark for Battery Life Prediction
Abstract:
Battery Life Prediction (BLP), which relies on time series data produced by battery degradation tests, is crucial for battery utilization, optimization, and production. Despite impressive advancements, this research area faces three key challenges. Firstly, the limited size of existing datasets impedes insights into modern battery life data. Secondly, most datasets are restricted to small-capacity lithium-ion batteries tested under a narrow range of diversity in labs, raising concerns about the generalizability of findings. Thirdly, inconsistent and limited benchmarks across studies obscure the effectiveness of baselines and leave it unclear if models popular in other time series fields are effective for BLP. To address these challenges, we propose BatteryLife, a comprehensive dataset and benchmark for BLP. BatteryLife integrates 16 datasets, offering a 2.5 times sample size compared to the previous largest dataset, and provides the most diverse battery life resource with batteries from 8 formats, 59 chemical systems, 9 operating temperatures, and 421 charge/discharge protocols, including both laboratory and industrial tests. Notably, BatteryLife is the first to release battery life datasets of zinc-ion batteries, sodium-ion batteries, and industry-tested large-capacity lithium-ion batteries. With the comprehensive dataset, we revisit the effectiveness of baselines popular in this and other time series fields. Furthermore, we propose CyclePatch, a plug-in technique that can be employed in various neural networks. Extensive benchmarking of 18 methods reveals that models popular in other time series fields can be unsuitable for BLP, and CyclePatch consistently improves model performance establishing state-of-the-art benchmarks. Moreover, BatteryLife evaluates model performance across aging conditions and domains. BatteryLife is available at https://github.com/Ruifeng-Tan/BatteryLife.
中文摘要:电池寿命预测面临数据集有限、测试条件单一和基准不一致的挑战,而提出的BatteryLife数据集和CyclePatch技术有效解决了这些问题,显著提升了模型性能并建立了新的性能基准。
English Summary: Battery Life Prediction faces challenges of limited datasets, narrow testing conditions, and inconsistent benchmarks, which are addressed by the proposed BatteryLife dataset and CyclePatch technique that significantly enhance model performance and establish new standards.

Authors:Zhiyuan Peng, Xin Yin, Rui Qian, Peiqin Lin, Yongkang Liu, Hao Zhang, Chenhao Ying, Yuan Luo
Title: SolEval: Benchmarking Large Language Models for Repository-level Solidity Code Generation
Abstract:
Large language models (LLMs) have transformed code generation. However, most existing approaches focus on mainstream languages such as Python and Java, neglecting the Solidity language, the predominant programming language for Ethereum smart contracts. Due to the lack of adequate benchmarks for Solidity, LLMs' ability to generate secure, cost-effective smart contracts remains unexplored. To fill this gap, we construct SolEval, the first repository-level benchmark designed for Solidity smart contract generation, to evaluate the performance of LLMs on Solidity. SolEval consists of 1,507 samples from 28 different repositories, covering 6 popular domains, providing LLMs with a comprehensive evaluation benchmark. Unlike the existing Solidity benchmark, SolEval not only includes complex function calls but also reflects the real-world complexity of the Ethereum ecosystem by incorporating Gas@k and Vul@k. We evaluate 16 LLMs on SolEval, and our results show that the best-performing LLM achieves only 26.29% Pass@10, highlighting substantial room for improvement in Solidity code generation by LLMs. Additionally, we conduct supervised fine-tuning (SFT) on Qwen-7B using SolEval, resulting in a significant performance improvement, with Pass@5 increasing from 16.67% to 58.33%, demonstrating the effectiveness of fine-tuning LLMs on our benchmark. We release our data and code at https://github.com/pzy2000/SolEval.
中文: 大型语言模型在代码生成方面取得进展,但缺乏对以太坊智能合约主要语言Solidity的评估,为此我们构建了首个仓库级基准SolEval,评估结果显示LLMs性能有限且存在较大提升空间,而微调能显著提高其表现。
English: Large language models (LLMs) have advanced code generation but lack evaluation for Solidity, the primary language for Ethereum smart contracts, prompting the creation of SolEval, the first repository-level benchmark that assesses LLMs' performance and reveals significant room for improvement, with fine-tuning showing notable gains.

Authors:Yuxiang Wang, Xinnan Dai, Wenqi Fan, Yao Ma
Title: Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation
Abstract:
Graph-structured data has become increasingly prevalent across various domains, raising the demand for effective models to handle graph tasks like node classification and link prediction. Traditional graph learning models like Graph Neural Networks (GNNs) have made significant strides, but their capabilities in handling graph data remain limited in certain contexts. In recent years, large language models (LLMs) have emerged as promising candidates for graph tasks, yet most studies focus primarily on performance benchmarks and fail to address their broader potential, including their ability to handle limited data, their transferability across tasks, and their robustness. In this work, we provide a comprehensive exploration of LLMs applied to graph tasks. We evaluate the performance of pure LLMs, including those without parameter optimization and those fine-tuned with instructions, across various scenarios. Our analysis goes beyond accuracy, assessing LLM ability to perform in few-shot/zero-shot settings, transfer across domains, understand graph structures, and demonstrate robustness in challenging scenarios. We conduct extensive experiments with 16 graph learning models alongside 6 LLMs (e.g., Llama3B, GPT-4o, Qwen-plus), comparing their performance on datasets like Cora, PubMed, ArXiv, and Products. Our findings show that LLMs, particularly those with instruction tuning, outperform traditional models in few-shot settings, exhibit strong domain transferability, and demonstrate excellent generalization and robustness. This work offers valuable insights into the capabilities of LLMs for graph learning, highlighting their advantages and potential for real-world applications, and paving the way for future research in this area. Codes and datasets are released in https://github.com/myflashbarry/LLM-benchmarking.
中文: 本研究全面评估大语言模型在图任务中的表现,发现经过指令微调的模型在少样本学习、跨领域迁移能力和鲁棒性方面显著优于传统图学习方法。
English: This study comprehensively evaluates large language models (LLMs) for graph tasks, demonstrating that instruction-tuned LLMs excel in few-shot learning, cross-domain transferability, and robustness compared to traditional graph models.

Authors:Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao
Title: Reward Shaping to Mitigate Reward Hacking in RLHF
Abstract:
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B, and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR, and the Work done during the internship at StepFun by Jiayi Fu.
Chinese: 本研究提出了一种新颖的奖励塑造方法PAR,通过利用奖励模型中的潜在偏好信号来有效缓解人类反馈强化学习中的奖励破解问题,在评估中展现出卓越的性能和鲁棒性。
English: This study introduces Preference As Reward (PAR), a novel reward shaping method that effectively mitigates reward hacking in reinforcement learning from human feedback by leveraging latent preferences, demonstrating superior performance and robustness in evaluations.

Authors:Chenlu Ju, Jiaxin Liu, Shobhit Sinha, Hao Xue, Flora Salim
Title: TrajLLM: A Modular LLM-Enhanced Agent-Based Framework for Realistic Human Trajectory Simulation
Abstract:
This work leverages Large Language Models (LLMs) to simulate human mobility, addressing challenges like high costs and privacy concerns in traditional models. Our hierarchical framework integrates persona generation, activity selection, and destination prediction, using real-world demographic and psychological data to create realistic movement patterns. Both physical models and language models are employed to explore and demonstrate different methodologies for human mobility simulation. By structuring data with summarization and weighted density metrics, the system ensures scalable memory management while retaining actionable insights. Preliminary results indicate that LLM-driven simulations align with observed real-world patterns, offering scalable, interpretable insights for social problems such as urban planning, traffic management, and public health. The framework's ability to dynamically generate personas and activities enables it to provide adaptable and realistic daily routines. This study demonstrates the transformative potential of LLMs in advancing mobility modeling for societal and urban applications. The source code and interactive demo for our framework are available at https://github.com/cju0/TrajLLM.
本研究利用大型语言模型通过分层框架模拟人类移动行为,生成逼真的活动轨迹,为城市规划和公共卫生提供可扩展的解决方案,同时解决了传统方法在隐私保护和成本方面的难题。
This study uses Large Language Models to simulate human mobility through a hierarchical framework that generates realistic movement patterns, offering scalable solutions for urban planning and public health while addressing privacy and cost issues of traditional methods.

Authors:Siqi Guo, Ilgee Hong, Vicente Balmaseda, Changlong Yu, Liang Qiu, Xin Liu, Haoming Jiang, Tuo Zhao, Tianbao Yang
Title: Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data
Abstract:
Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative training objective. To address its limitations, the existing common strategy is to follow SFT with a separate phase of preference optimization (PO), which relies on either human-labeled preference data or a strong reward model to guide the learning process. In this paper, we address the limitations of SFT by exploring one of the most successful techniques in conventional supervised learning: discriminative learning. We introduce Discriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data or training strong reward models. Unlike SFT that employs a generative approach and overlooks negative data, DFT adopts a discriminative paradigm that increases the probability of positive answers while suppressing potentially negative ones, aiming for data prediction instead of token prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness, achieving performance better than SFT and comparable to if not better than SFT$\rightarrow$PO. The code can be found at https://github.com/Optimization-AI/DFT.
Chinese: 本文提出了一种改进的监督微调方法——判别式微调(DFT),它采用判别式学习范式增强正面回答的概率并抑制负面回答,无需依赖偏好数据或奖励模型即可实现优于传统方法或与之相当的性能。
English: This paper introduces Discriminative Fine-Tuning (DFT), an enhanced version of supervised fine-tuning that uses discriminative learning to increase the likelihood of positive responses while reducing negative ones, achieving performance comparable to or better than traditional methods without requiring preference data or reward models.

Authors:Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
Title: Chain of Draft: Thinking Faster by Writing Less
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks. Our code and data are available at https://github.com/sileix/chain-of-draft.
中文总结:提出的思维草稿链(CoD)方法让大语言模型生成极简的中间推理步骤,仅用7.6%的词汇量即可达到或超越思维链的准确率,显著提升效率。
English Summary: The proposed Chain of Draft (CoD) method enables LLMs to produce minimal intermediate reasoning steps, achieving comparable or superior accuracy to Chain-of-Thought while using only 7.6% of tokens for greater efficiency.

Authors:Anton Lavrouk, Tarek Naous, Alan Ritter, Wei Xu
Title: What are Foundation Models Cooking in the Post-Soviet World?
Abstract:
The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multimodal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multimodal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pretraining data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models' abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding. To foster further research, we will make BORSch publicly available at https://github.com/alavrouk/BORSch.
中文: 本研究通过引入BORSch后苏联菜肴多模态数据集,揭示了基础模型因语言偏见常误判菜肴起源,并证明仅靠问答不足以评估文化理解能力。
English: This study introduces BORSch, a multimodal dataset of Post-Soviet dishes, revealing that foundation models often misattribute dish origins due to linguistic biases and demonstrates that question-answering alone is inadequate for evaluating cultural understanding.

Authors:Zhewei Kang, Xuandong Zhao, Dawn Song
Title: Scalable Best-of-N Selection for Large Language Models via Self-Certainty
Abstract:
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size $N$, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
中文: 本文提出“自确定性”这一新指标,利用大语言模型内部概率分布来高效评估回答质量,无需外部奖励模型,实验证明其在推理任务中比现有方法具有更好的可扩展性和性能表现。
English: The paper introduces "self-certainty," a novel metric that uses LLMs' internal probability distributions to efficiently evaluate response quality without external rewards, demonstrating improved scalability and performance across reasoning tasks compared to existing methods.

Authors:Zike Yuan, Ming Liu, Hui Wang, Bing Qin
Title: MA-GTS: A Multi-Agent Framework for Solving Complex Graph Problems in Real-World Applications
Abstract:
Graph-theoretic problems arise in real-world applications like logistics, communication networks, and traffic optimization. These problems are often complex, noisy, and irregular, posing challenges for traditional algorithms. Large language models (LLMs) offer potential solutions but face challenges, including limited accuracy and input length constraints. To address these challenges, we propose MA-GTS (Multi-Agent Graph Theory Solver), a multi-agent framework that decomposes these complex problems through agent collaboration. MA-GTS maps the implicitly expressed text-based graph data into clear, structured graph representations and dynamically selects the most suitable algorithm based on problem constraints and graph structure scale. This approach ensures that the solution process remains efficient and the resulting reasoning path is interpretable. We validate MA-GTS using the G-REAL dataset, a real-world-inspired graph theory dataset we created. Experimental results show that MA-GTS outperforms state-of-the-art approaches in terms of efficiency, accuracy, and scalability, with strong results across multiple benchmarks (G-REAL 94.2%, GraCoRe 96.9%, NLGraph 98.4%).MA-GTS is open-sourced at https://github.com/ZIKEYUAN/MA-GTS.git.
中文:我们提出MA-GTS多智能体框架,通过将文本数据转化为结构化图并动态选择算法,有效解决复杂图论问题,在多个基准测试中实现了卓越的效率、准确性和可扩展性。
English: We propose MA-GTS, a multi-agent framework that effectively solves complex graph problems by transforming text data into structured graphs and dynamically selecting algorithms, achieving superior efficiency, accuracy, and scalability across benchmarks.

Authors:PIN AI Team, Bill Sun, Gavin Guo, Regan Peng, Boliang Zhang, Shouqiao Wang, Laura Florescu, Xi Wang, Davide Crapis, Ben Wu
Title: GOD model: Privacy Preserved AI School for Personal Assistant
Abstract:
Personal AI assistants (e.g., Apple Intelligence, Meta AI) offer proactive recommendations that simplify everyday tasks, but their reliance on sensitive user data raises concerns about privacy and trust. To address these challenges, we introduce the Guardian of Data (GOD), a secure, privacy-preserving framework for training and evaluating AI assistants directly on-device. Unlike traditional benchmarks, the GOD model measures how well assistants can anticipate user needs-such as suggesting gifts-while protecting user data and autonomy. Functioning like an AI school, it addresses the cold start problem by simulating user queries and employing a curriculum-based approach to refine the performance of each assistant. Running within a Trusted Execution Environment (TEE), it safeguards user data while applying reinforcement and imitation learning to refine AI recommendations. A token-based incentive system encourages users to share data securely, creating a data flywheel that drives continuous improvement. Specifically, users mine with their data, and the mining rate is determined by GOD's evaluation of how well their AI assistant understands them across categories such as shopping, social interactions, productivity, trading, and Web3. By integrating privacy, personalization, and trust, the GOD model provides a scalable, responsible path for advancing personal AI assistants. For community collaboration, part of the framework is open-sourced at https://github.com/PIN-AI/God-Model.
中文: GOD框架提出了一种安全的设备端训练与评估系统,通过可信执行环境、激励策略和模拟学习,在实现智能助手主动推荐功能的同时保障用户隐私,为负责任的人工智能发展提供可扩展路径。
English: The GOD framework introduces a secure, on-device training and evaluation system for personal AI assistants that balances proactive recommendations with privacy protection through a TEE, incentive mechanisms, and simulated learning to advance responsible AI development.

Authors:Yao Su, Keqi Han, Mingjie Zeng, Lichao Sun, Liang Zhan, Carl Yang, Lifang He, Xiangnan Kong
Title: End-to-End Deep Learning for Structural Brain Imaging: A Unified Framework
Abstract:
Brain imaging analysis is fundamental in neuroscience, providing valuable insights into brain structure and function. Traditional workflows follow a sequential pipeline-brain extraction, registration, segmentation, parcellation, network generation, and classification-treating each step as an independent task. These methods rely heavily on task-specific training data and expert intervention to correct intermediate errors, making them particularly burdensome for high-dimensional neuroimaging data, where annotations and quality control are costly and time-consuming. We introduce UniBrain, a unified end-to-end framework that integrates all processing steps into a single optimization process, allowing tasks to interact and refine each other. Unlike traditional approaches that require extensive task-specific annotations, UniBrain operates with minimal supervision, leveraging only low-cost labels (i.e., classification and extraction) and a single labeled atlas. By jointly optimizing extraction, registration, segmentation, parcellation, network generation, and classification, UniBrain enhances both accuracy and computational efficiency while significantly reducing annotation effort. Experimental results demonstrate its superiority over existing methods across multiple tasks, offering a more scalable and reliable solution for neuroimaging analysis. Our code and data can be found at https://github.com/Anonymous7852/UniBrain
中文: UniBrain是一种统一端到端框架,将神经影像处理步骤整合为单一优化流程,在减少标注需求的同时显著提高了分析的准确性和效率。
English: UniBrain is a unified end-to-end framework that integrates all neuroimaging processing steps into a single optimization process, enhancing accuracy and efficiency while minimizing annotation requirements.

Authors:Sefik Serengil, Alper Ozpinar
Title: CipherFace: A Fully Homomorphic Encryption-Driven Framework for Secure Cloud-Based Facial Recognition
Abstract:
Facial recognition systems rely on embeddings to represent facial images and determine identity by verifying if the distance between embeddings is below a pre-tuned threshold. While embeddings are not reversible to original images, they still contain sensitive information, making their security critical. Traditional encryption methods like AES are limited in securely utilizing cloud computational power for distance calculations. Homomorphic Encryption, allowing calculations on encrypted data, offers a robust alternative. This paper introduces CipherFace, a homomorphic encryption-driven framework for secure cloud-based facial recognition, which we have open-sourced at http://github.com/serengil/cipherface. By leveraging FHE, CipherFace ensures the privacy of embeddings while utilizing the cloud for efficient distance computation. Furthermore, we propose a novel encrypted distance computation method for both Euclidean and Cosine distances, addressing key challenges in performing secure similarity calculations on encrypted data. We also conducted experiments with different facial recognition models, various embedding sizes, and cryptosystem configurations, demonstrating the scalability and effectiveness of CipherFace in real-world applications.
Chinese: 本文提出了CipherFace,一个基于同态加密的开源框架,通过保护敏感嵌入特征并高效计算加密的欧几里得与余弦距离,实现安全的云端人脸识别,适用于实际应用场景。
English: This paper introduces CipherFace, an open-source homomorphic encryption framework that enables secure cloud-based facial recognition by protecting sensitive embeddings while efficiently computing encrypted Euclidean and Cosine distances for real-world applications.

Authors:Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy
Title: Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
Abstract:
Conversational agents are increasingly woven into individuals' personal lives, yet users often underestimate the privacy risks associated with them. The moment users share information with these agents-such as large language models (LLMs)-their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLM-based Conversational Agents (LCAs). It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LCAs (untrusted receivers). Through a formative design user study, we observe how even "privacy-conscious" users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally deployable framework that operates between users and LCAs, identifying and reformulating out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user's intended interaction goals. Notably, about 76% of participants in our human evaluation preferred the reformulated prompts over the original ones, validating the usability and effectiveness of contextual privacy in our proposed framework. We opensource the code at https://github.com/IBM/contextual-privacy-LLM.
中文: 本文提出了面向大语言模型对话代理的情境隐私概念,通过一个本地部署框架有效重构用户提示以最小化非必要信息泄露,在保持交互目标的同时获得76%参与者对隐私增强版本的选择偏好。
English: This paper introduces contextual privacy for LLM-based conversational agents, proposing a local framework that effectively reformulates user prompts to minimize unnecessary information disclosure while preserving interaction goals, with 76% of participants preferring the privacy-enhanced versions.

Authors:Yukun Chen, Shuo Shao, Enhao Huang, Yiming Li, Pin-Yu Chen, Zhan Qin, Kui Ren
Title: REFINE: Inversion-Free Backdoor Defense via Model Reprogramming
Abstract:
Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf{(1)} an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf{(2)} an output remapping module that redefines the model's output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.
中文摘要:REFINE是一种基于模型重编程的无逆向后门防御方法,通过输入转换和输出重映射模块协同工作,在无需触发模式重构的情况下有效消除后门威胁,同时保持模型性能。
English Summary: REFINE is a novel backdoor defense method that uses model reprogramming with input transformation and output remapping to neutralize embedded triggers without inversion, effectively balancing security and model performance.

Authors:Aman Goel, Xian Carrie Wu, Zhe Wang, Dmitriy Bespalov, Yanjun Qi
Title: TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
Abstract:
Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves $\geq$ 95\% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \& GPT-4 Turbo), shows impressive generalizability to unseen harmful questions, and helps in improving model defenses to prompt attacks. TurboFuzzLLM is available open source at https://github.com/amazon-science/TurboFuzzLLM.
中文摘要:TurboFuzzLLM是一种基于变异的模糊测试技术,能自动生成有效的越狱模板来测试大语言模型的鲁棒性,对GPT-4o等领先模型的攻击成功率超过95%,同时有助于增强模型防御能力。
English Summary: TurboFuzzLLM is a mutation-based fuzzing technique that automatically generates effective jailbreaking templates to test LLM robustness, achieving over 95% attack success rates against models like GPT-4o while helping improve their defenses.

Authors:Xuemeng Song, Haoqiang Lin, Haokun Wen, Bohan Hou, Mingzhu Xu, Liqiang Nie
Title: A Comprehensive Survey on Composed Image Retrieval
Abstract:
Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration. The curated collection of related works is maintained and continuously updated in https://github.com/haokunwen/Awesome-Composed-Image-Retrieval.
中文: 组合图像检索(CIR)是一项结合参考图像和修改文本进行目标图像搜索的多模态任务,本综述综合了120多篇文献,系统分类模型、评估基准并展望未来方向。
English: Composed Image Retrieval (CIR) is a challenging multimodal task that combines a reference image and modification text to search for target images, and this review synthesizes over 120 publications to systematically categorize models, evaluate benchmarks, and outline future directions.

Authors:Mira Adra, Simone Melcarne, Nelida Mirabet-Herranz, Jean-Luc Dugelay
Title: Event-based Solutions for Human-centered Applications: A Comprehensive Review
Abstract:
Event cameras, often referred to as dynamic vision sensors, are groundbreaking sensors capable of capturing changes in light intensity asynchronously, offering exceptional temporal resolution and energy efficiency. These attributes make them particularly suited for human-centered applications, as they capture both the most intricate details of facial expressions and the complex motion dynamics of the human body. Despite growing interest, research in human-centered applications of event cameras remains scattered, with no comprehensive overview encompassing both body and face tasks. This survey bridges that gap by being the first to unify these domains, presenting an extensive review of advancements, challenges, and opportunities. We also examine less-explored areas, including event compression techniques and simulation frameworks, which are essential for the broader adoption of event cameras. This survey is designed to serve as a foundational reference that helps both new and experienced researchers understand the current state of the field and identify promising directions for future work in human-centered event camera applications. A summary of this survey can be found at https://github.com/nmirabeth/event_human
中文摘要:本综述首次统一了事件相机在人体和面部分析中的人本应用研究,系统梳理了该领域的发展现状、挑战及未来方向。
English Summary: This survey provides the first unified overview of event camera applications in human-centered tasks, addressing advancements, challenges, and future directions for both body and face analysis.

Authors:Yafei Ou, Mahdi Tavakoli
Title: CRESSim-MPM: A Material Point Method Library for Surgical Soft Body Simulation with Cutting and Suturing
Abstract:
A number of recent studies have focused on developing surgical simulation platforms to train machine learning (ML) agents or models with synthetic data for surgical assistance. While existing platforms excel at tasks such as rigid body manipulation and soft body deformation, they struggle to simulate more complex soft body behaviors like cutting and suturing. A key challenge lies in modeling soft body fracture and splitting using the finite-element method (FEM), which is the predominant approach in current platforms. Additionally, the two-way suture needle/thread contact inside a soft body is further complicated when using FEM. In this work, we use the material point method (MPM) for such challenging simulations and propose new rigid geometries and soft-rigid contact methods specifically designed for them. We introduce CRESSim-MPM, a GPU-accelerated MPM library that integrates multiple MPM solvers and incorporates surgical geometries for cutting and suturing, serving as a specialized physics engine for surgical applications. It is further integrated into Unity, requiring minimal modifications to existing projects for soft body simulation. We demonstrate the simulator's capabilities in real-time simulation of cutting and suturing on soft tissue and provide an initial performance evaluation of different MPM solvers when simulating varying numbers of particles. The source code is available at https://github.com/yafei-ou/CRESSim-MPM.
中文摘要:本研究开发了CRESSim-MPM物理引擎,采用物质点方法突破现有平台在模拟切割缝合等复杂手术操作时的局限,通过GPU加速实现软组织实时仿真,并集成至Unity平台便于应用。
English Summary: This study introduces CRESSim-MPM, a GPU-accelerated physics library using the material point method to overcome limitations in simulating complex surgical procedures like cutting and suturing, integrated into Unity for real-time soft tissue simulation.

Authors:Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly
Title: What Makes the Preferred Thinking Direction for LLMs in Multiple-choice Questions?
Abstract:
Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability, and directional conditional entropy. We analyze the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous. Our code and checkpoints are released at https://github.com/apple/ml-reversal-blessing.
中文摘要:本研究表明,在多项选择题推理任务中,从右向左训练的语言模型优于传统的从左向右模型,揭示了性能与校准度和条件熵的关联,并为文本分布的最佳因子化提供了理论依据。
English Summary: This study demonstrates that right-to-left (R2L) trained language models outperform traditional left-to-right (L2R) models on multiple-choice reasoning tasks, revealing performance links to calibration and conditional entropy while providing theoretical insights into optimal text factorization.

Authors:Alexander Groshev, Anastasiia Iashchenko, Pavel Paramonov, Denis Dimitrov, Andrey Kuznetsov
Title: GHOST 2.0: generative high-fidelity one shot transfer of heads
Abstract:
While the task of face swapping has recently gained attention in the research community, a related problem of head swapping remains largely unexplored. In addition to skin color transfer, head swap poses extra challenges, such as the need to preserve structural information of the whole head during synthesis and inpaint gaps between swapped head and background. In this paper, we address these concerns with GHOST 2.0, which consists of two problem-specific modules. First, we introduce enhanced Aligner model for head reenactment, which preserves identity information at multiple scales and is robust to extreme pose variations. Secondly, we use a Blender module that seamlessly integrates the reenacted head into the target background by transferring skin color and inpainting mismatched regions. Both modules outperform the baselines on the corresponding tasks, allowing to achieve state of the art results in head swapping. We also tackle complex cases, such as large difference in hair styles of source and target. Code is available at https://github.com/ai-forever/ghost-2.0
Chinese: GHOST 2.0通过增强的Aligner模块实现鲁棒头部重演,结合Blender模块无缝融合头部与背景,在头部替换任务中达到领先水平,并能有效处理姿态变化和发型差异等复杂情况。
English: GHOST 2.0 introduces an enhanced Aligner for robust head reenactment and a Blender for seamless integration, achieving state-of-the-art head swapping results while handling challenges like pose variations and hairstyle differences.

Authors:Henry Peng Zou, Siffi Singh, Yi Nian, Jianfeng He, Jason Cai, Saab Mansour, Hang Su
Title: GLEAN: Generalized Category Discovery with Diverse and Quality-Enhanced LLM Feedback
Abstract:
Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose GLEAN, a unified framework for generalized category discovery that actively learns from diverse and quality-enhanced LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of \MethodName over state-of-the-art models across diverse datasets, metrics, and supervision settings. Our code is available at https://github.com/amazon-science/Glean.
Chinese: GLEAN是一个统一框架,通过主动学习多样化和质量增强的大型语言模型反馈,解决在未标记数据中识别已知和未知类别的挑战。
English: GLEAN is a unified framework for generalized category discovery that actively learns from diverse and quality-enhanced LLM feedback to address challenges in recognizing both known and novel categories in unlabeled data.

Authors:Xiangyu Zhao, Shengyuan Ding, Zicheng Zhang, Haian Huang, Maosong Cao, Weiyun Wang, Jiaqi Wang, Xinyu Fang, Wenhai Wang, Guangtao Zhai, Haodong Duan, Hua Yang, Kai Chen
Title: OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Abstract:
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities. Our datasets, benchmark, code and checkpoints have been released at https://github.com/PhoenixZ810/OmniAlign-V.
中文: 本文提出了包含20万高质量样本的OmniAlign-V数据集和人工标注的MM-AlignBench基准,通过微调方法在保持多模态大语言模型基础能力的同时,显著提升了其与人类偏好的对齐程度。
English: This paper introduces OmniAlign-V, a 200K-sample dataset, and MM-AlignBench, a human-annotated benchmark, to enhance multi-modal large language models' alignment with human preferences while preserving their foundational capabilities through fine-tuning methods.

Authors:Ahmed Elhady, Eneko Agirre, Mikel Artetxe
Title: WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
Abstract:
We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.
中文: WiCkeD通过引入“以上都不是”选项来增强多项选择题库的复杂性,显著降低了模型表现,并揭示了不同大语言模型在推理能力上的差异敏感性。
English: WiCkeD enhances the complexity of multiple-choice benchmarks by adding "None of the above" options, significantly reducing model performance and revealing varied reasoning sensitivities across LLMs.

Authors:Jianhao Yan, Yun Luo, Yue Zhang
Title: RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction
Abstract:
In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. https://github.com/ElliottYan/RefuteBench-2.0
中文摘要:RefuteBench 2.0通过引入LLM代理作为反驳者和评估者,系统评估语言模型整合用户反馈的能力,发现现有模型虽能应对反驳,但在长对话中难以有效保持和运用这些信息。
English Summary: RefuteBench 2.0 introduces LLM agents as refuters and evaluators to assess how well language models incorporate user feedback, revealing that while current models address refutations, they struggle with retaining this information during extended dialogues.

Authors:Zenghui Chang, Yiqiao Zhang, Hong Cai Chen
Title: Neural Network Graph Similarity Computation Based on Graph Fusion
Abstract:
Graph similarity learning, crucial for tasks such as graph classification and similarity search, focuses on measuring the similarity between two graph-structured entities. The core challenge in this field is effectively managing the interactions between graphs. Traditional methods often entail separate, redundant computations for each graph pair, leading to unnecessary complexity. This paper revolutionizes the approach by introducing a parallel graph interaction method called graph fusion. By merging the node sequences of graph pairs into a single large graph, our method leverages a global attention mechanism to facilitate interaction computations and to harvest cross-graph insights. We further assess the similarity between graph pairs at two distinct levels-graph-level and node-level-introducing two innovative, yet straightforward, similarity computation algorithms. Extensive testing across five public datasets shows that our model not only outperforms leading baseline models in graph-to-graph classification and regression tasks but also sets a new benchmark for performance and efficiency. The code for this paper is open-source and available at https://github.com/LLiRarry/GFM-code.git
Chinese Summary: 本文提出了一种名为图融合的并行图交互方法,通过将节点序列合并为单一图并利用全局注意力机制进行交互计算,同时在图和节点两个层面评估相似性,在多项任务中实现了卓越的性能和效率突破。
English Summary: This paper introduces a parallel graph interaction method called graph fusion, which merges node sequences into a single graph to enable efficient cross-graph insights through global attention and dual-level similarity computation, achieving superior performance and efficiency in graph classification and regression tasks.

Authors:Jun Zeng, Debesh Jha, Ertugrul Aktas, Elif Keles, Alpay Medetalibeyoglu, Matthew Antalek, Robert Lewandowski, Daniela Ladner, Amir A. Borhani, Gorkem Durak, Ulas Bagci
Title: A Reverse Mamba Attention Network for Pathological Liver Segmentation
Abstract:
We present RMA-Mamba, a novel architecture that advances the capabilities of vision state space models through a specialized reverse mamba attention module (RMA). The key innovation lies in RMA-Mamba's ability to capture long-range dependencies while maintaining precise local feature representation through its hierarchical processing pipeline. By integrating Vision Mamba (VMamba)'s efficient sequence modeling with RMA's targeted feature refinement, our architecture achieves superior feature learning across multiple scales. This dual-mechanism approach enables robust handling of complex morphological patterns while maintaining computational efficiency. We demonstrate RMA-Mamba's effectiveness in the challenging domain of pathological liver segmentation (from both CT and MRI), where traditional segmentation approaches often fail due to tissue variations. When evaluated on a newly introduced cirrhotic liver dataset (CirrMRI600+) of T2-weighted MRI scans, RMA-Mamba achieves the state-of-the-art performance with a Dice coefficient of 92.08%, mean IoU of 87.36%, and recall of 92.96%. The architecture's generalizability is further validated on the cancerous liver segmentation from CT scans (LiTS: Liver Tumor Segmentation dataset), yielding a Dice score of 92.9% and mIoU of 88.99%. Our code is available for public: https://github.com/JunZengz/RMAMamba.
中文: RMA-Mamba通过反向曼巴注意力模块提升视觉状态空间模型,能捕捉长程依赖并优化局部特征,在CT和MRI的病理肝脏分割任务中实现了最先进的性能。
English: RMA-Mamba introduces a reverse mamba attention module to enhance vision state space models by capturing long-range dependencies and refining local features, achieving state-of-the-art performance in pathological liver segmentation from CT and MRI scans.

Authors:Jun Zeng, Debesh Jha, Ertugrul Aktas, Elif Keles, Alpay Medetalibeyoglu, Matthew Antalek, Federica Proietto Salanitri, Amir A. Borhani, Daniela P. Ladner, Gorkem Durak, Ulas Bagci
Title: Liver Cirrhosis Stage Estimation from MRI with Deep Learning
Abstract:
We present an end-to-end deep learning framework for automated liver cirrhosis stage estimation from multi-sequence MRI. Cirrhosis is the severe scarring (fibrosis) of the liver and a common endpoint of various chronic liver diseases. Early diagnosis is vital to prevent complications such as decompensation and cancer, which significantly decreases life expectancy. However, diagnosing cirrhosis in its early stages is challenging, and patients often present with life-threatening complications. Our approach integrates multi-scale feature learning with sequence-specific attention mechanisms to capture subtle tissue variations across cirrhosis progression stages. Using CirrMRI600+, a large-scale publicly available dataset of 628 high-resolution MRI scans from 339 patients, we demonstrate state-of-the-art performance in three-stage cirrhosis classification. Our best model achieves 72.8% accuracy on T1W and 63.8% on T2W sequences, significantly outperforming traditional radiomics-based approaches. Through extensive ablation studies, we show that our architecture effectively learns stage-specific imaging biomarkers. We establish new benchmarks for automated cirrhosis staging and provide insights for developing clinically applicable deep learning systems. The source code will be available at https://github.com/JunZengz/CirrhosisStage.
中文: 我们提出了一种端到端的深度学习框架,通过多尺度特征学习和序列特异性注意力机制,在大规模MRI数据集上实现了肝硬化自动分期的先进性能。
English: We introduce an end-to-end deep learning framework that achieves state-of-the-art performance in automated liver cirrhosis staging from multi-sequence MRI, using multi-scale feature learning and sequence-specific attention mechanisms on a large-scale dataset.

Authors:He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler
Title: UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking
Abstract:
Multi-modal tracking is essential in single-object tracking (SOT), as different sensor types contribute unique capabilities to overcome challenges caused by variations in object appearance. However, existing unified RGB-X trackers (X represents depth, event, or thermal modality) either rely on the task-specific training strategy for individual RGB-X image pairs or fail to address the critical importance of modality-adaptive perception in real-world applications. In this work, we propose UASTrack, a unified adaptive selection framework that facilitates both model and parameter unification, as well as adaptive modality discrimination across various multi-modal tracking tasks. To achieve modality-adaptive perception in joint RGB-X pairs, we design a Discriminative Auto-Selector (DAS) capable of identifying modality labels, thereby distinguishing the data distributions of auxiliary modalities. Furthermore, we propose a Task-Customized Optimization Adapter (TCOA) tailored to various modalities in the latent space. This strategy effectively filters noise redundancy and mitigates background interference based on the specific characteristics of each modality. Extensive comparisons conducted on five benchmarks including LasHeR, GTOT, RGBT234, VisEvent, and DepthTrack, covering RGB-T, RGB-E, and RGB-D tracking scenarios, demonstrate our innovative approach achieves comparative performance by introducing only additional training parameters of 1.87M and flops of 1.95G. The code will be available at https://github.com/wanghe/UASTrack.
中文: UASTrack提出了一种统一的自适应选择框架,通过判别性自动选择器和任务定制优化适配器实现模态自适应感知,在多个RGB-X跟踪基准上以少量额外参数实现了优越性能。
English: UASTrack introduces a unified adaptive selection framework that achieves modality-adaptive perception through a Discriminative Auto-Selector and Task-Customized Optimization Adapter, delivering competitive performance across multiple RGB-X tracking benchmarks with minimal additional parameters.

Authors:Botao Ye, Sifei Liu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang
Title: Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training
Abstract:
Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image. However, these models often face challenges in maintaining consistency across novel and reference views. A crucial factor leading to this issue is the limited utilization of contextual information from reference views. Specifically, when there is an overlap in the viewing frustum between two views, it is essential to ensure that the corresponding regions maintain consistency in both geometry and appearance. This observation leads to a simple yet effective approach, where we propose to use epipolar geometry to locate and retrieve overlapping information from the input view. This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning, as the process requires no learnable parameters. Furthermore, to enhance the overall consistency of generated views, we extend the utilization of epipolar attention to a multi-view setting, allowing retrieval of overlapping information from the input view and other target views. Qualitative and quantitative experimental results demonstrate the effectiveness of our method in significantly improving the consistency of synthesized views without the need for any fine-tuning. Moreover, This enhancement also boosts the performance of downstream applications such as 3D reconstruction. The code is available at https://github.com/botaoye/ConsisSyn.
中文摘要:本研究提出一种新方法,利用极线几何从参考视图中检索重叠信息,以增强零样本新视角合成的一致性,无需训练或微调。
English Summary: This study introduces a novel method that leverages epipolar geometry to enhance consistency in zero-shot novel view synthesis by retrieving overlapping information from reference views, eliminating the need for training or fine-tuning.

Authors:Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji
Title: VCT: Training Consistency Models with Variational Noise Coupling
Abstract:
Consistency Training (CT) has recently emerged as a strong alternative to diffusion models for image generation. However, non-distillation CT often suffers from high variance and instability, motivating ongoing research into its training dynamics. We propose Variational Consistency Training (VCT), a flexible and effective framework compatible with various forward kernels, including those in flow matching. Its key innovation is a learned noise-data coupling scheme inspired by Variational Autoencoders, where a data-dependent encoder models noise emission. This enables VCT to adaptively learn noise-todata pairings, reducing training variance relative to the fixed, unsorted pairings in classical CT. Experiments on multiple image datasets demonstrate significant improvements: our method surpasses baselines, achieves state-of-the-art FID among non-distillation CT approaches on CIFAR-10, and matches SoTA performance on ImageNet 64 x 64 with only two sampling steps. Code is available at https://github.com/sony/vct.
中文: 变分一致性训练(VCT)通过学习噪声与数据的自适应配对机制,有效降低了训练方差,在图像生成任务中仅需两步采样即可达到顶尖性能。
English: Variational Consistency Training (VCT) introduces a learned noise-data coupling scheme to reduce training variance and instability, achieving state-of-the-art results on image generation tasks with minimal sampling steps.

Authors:Anh-Khoa Nguyen Vu, Quoc-Truong Truong, Vinh-Tiep Nguyen, Thanh Duc Ngo, Thanh-Toan Do, Tam V. Nguyen
Title: Multi-Perspective Data Augmentation for Few-shot Object Detection
Abstract:
Recent few-shot object detection (FSOD) methods have focused on augmenting synthetic samples for novel classes, show promising results to the rise of diffusion models. However, the diversity of such datasets is often limited in representativeness because they lack awareness of typical and hard samples, especially in the context of foreground and background relationships. To tackle this issue, we propose a Multi-Perspective Data Augmentation (MPAD) framework. In terms of foreground-foreground relationships, we propose in-context learning for object synthesis (ICOS) with bounding box adjustments to enhance the detail and spatial information of synthetic samples. Inspired by the large margin principle, support samples play a vital role in defining class boundaries. Therefore, we design a Harmonic Prompt Aggregation Scheduler (HPAS) to mix prompt embeddings at each time step of the generation process in diffusion models, producing hard novel samples. For foreground-background relationships, we introduce a Background Proposal method (BAP) to sample typical and hard backgrounds. Extensive experiments on multiple FSOD benchmarks demonstrate the effectiveness of our approach. Our framework significantly outperforms traditional methods, achieving an average increase of $17.5\%$ in nAP50 over the baseline on PASCAL VOC. Code is available at https://github.com/nvakhoa/MPAD.
Chinese: 提出的多视角数据增强(MPAD)框架通过前景调整、困难样本合成和背景优化生成多样且具挑战性的合成样本,显著提升了少样本目标检测性能,在PASCAL VOC数据集上相比基线方法nAP50指标平均提升17.5%。
English: The proposed Multi-Perspective Data Augmentation (MPAD) framework enhances few-shot object detection by generating diverse and challenging synthetic samples through foreground adjustments, hard sample synthesis, and background optimization, achieving a 17.5% nAP50 improvement over baselines on PASCAL VOC.

Authors:Adnan Iltaf, Rayan Merghani Ahmed, Zhenxi Zhang, Bin Li, Shoujun Zhou
Title: VesselSAM: Leveraging SAM for Aortic Vessel Segmentation with AtrousLoRA
Abstract:
Medical image segmentation is crucial for clinical diagnosis and treatment planning, especially when dealing with complex anatomical structures such as vessels. However, accurately segmenting vessels remains challenging due to their small size, intricate edge structures, and susceptibility to artifacts and imaging noise. In this work, we propose VesselSAM, an enhanced version of the Segment Anything Model (SAM), specifically tailored for aortic vessel segmentation. VesselSAM incorporates AtrousLoRA, a novel module integrating Atrous Attention and Low-Rank Adaptation (LoRA), to enhance segmentation performance. Atrous Attention enables the model to capture multi-scale contextual information, preserving both fine-grained local details and broader global context. Additionally, LoRA facilitates efficient fine-tuning of the frozen SAM image encoder, reducing the number of trainable parameters and thereby enhancing computational efficiency. We evaluate VesselSAM using two challenging datasets: the Aortic Vessel Tree (AVT) dataset and the Type-B Aortic Dissection (TBAD) dataset. VesselSAM achieves state-of-the-art performance, attaining DSC scores of 93.50\%, 93.25\%, 93.02\%, and 93.26\% across multi-center datasets. Our results demonstrate that VesselSAM delivers high segmentation accuracy while significantly reducing computational overhead compared to existing large-scale models. This development paves the way for enhanced AI-based aortic vessel segmentation in clinical environments. The code and models will be released at https://github.com/Adnan-CAS/AtrousLora.
中文: VesselSAM通过集成AtrousLoRA模块改进了Segment Anything模型,在主动脉血管分割中实现了顶尖的精度和计算效率,适用于多中心临床数据。
English: VesselSAM enhances the Segment Anything Model with AtrousLoRA for superior aortic vessel segmentation, achieving state-of-the-art accuracy and computational efficiency on multi-center datasets.

Authors:Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst
Title: Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs
Abstract:
This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study investigates the sub-problems and methods within these core challenges, such as input representation, chunking, prompting, selection of LLMs, and multimodal models. It examines the effect of different design choices through LayIE-LLM, a new, open-source, layout-aware IE test suite, benchmarking against traditional, fine-tuned IE models. The results on two IE datasets show that LLMs require adjustment of the IE pipeline to achieve competitive performance: the optimized configuration found with LayIE-LLM achieves 13.3--37.5 F1 points more than a general-practice baseline configuration using the same LLM. To find a well-working configuration, we develop a one-factor-at-a-time (OFAT) method that achieves near-optimal results. Our method is only 0.8--1.8 points lower than the best full factorial exploration with a fraction (2.8%) of the required computation. Overall, we demonstrate that, if well-configured, general-purpose LLMs match the performance of specialized models, providing a cost-effective, finetuning-free alternative. Our test-suite is available at https://github.com/gayecolakoglu/LayIE-LLM.
中文: 本研究探讨了利用大型语言模型进行布局感知信息提取的设计空间,通过测试套件证明优化配置无需微调即可媲美专用模型的性能。
English: This study explores the design space for layout-aware information extraction using large language models, introducing a test suite that demonstrates optimized configurations can match specialized model performance without fine-tuning.

Authors:Mingkun Zhang, Keping Bi, Wei Chen, Jiafeng Guo, Xueqi Cheng
Title: CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification
Abstract:
In this paper, we aim to build an adversarially robust zero-shot image classifier. We ground our work on CLIP, a vision-language pre-trained encoder model that can perform zero-shot classification by matching an image with text prompts ``a photo of a .''. Purification is the path we choose since it does not require adversarial training on specific attack types and thus can cope with any foreseen attacks. We then formulate purification risk as the KL divergence between the joint distributions of the purification process of denoising the adversarial samples and the attack process of adding perturbations to benign samples, through bidirectional Stochastic Differential Equations (SDEs). The final derived results inspire us to explore purification in the multi-modal latent space of CLIP. We propose two variants for our CLIPure approach: CLIPure-Diff which models the likelihood of images' latent vectors with the DiffusionPrior module in DaLLE-2 (modeling the generation process of CLIP's latent vectors), and CLIPure-Cos which models the likelihood with the cosine similarity between the embeddings of an image and ``a photo of a.''. As far as we know, CLIPure is the first purification method in multi-modal latent space and CLIPure-Cos is the first purification method that is not based on generative models, which substantially improves defense efficiency. We conducted extensive experiments on CIFAR-10, ImageNet, and 13 datasets that previous CLIP-based defense methods used for evaluating zero-shot classification robustness. Results show that CLIPure boosts the SOTA robustness by a large margin, e.g., from 71.7% to 91.1% on CIFAR10, from 59.6% to 72.6% on ImageNet, and 108% relative improvements of average robustness on the 13 datasets over previous SOTA. The code is available at https://github.com/TMLResearchGroup-CAS/CLIPure.
本研究提出了CLIPure,一种在CLIP多模态潜在空间中运行的零样本图像分类器对抗净化新方法,无需依赖生成模型即可显著提升防御鲁棒性。
This paper introduces CLIPure, a novel adversarial purification method for zero-shot image classifiers that operates in CLIP's multi-modal latent space, significantly boosting robustness without relying on generative models.

Authors:Yunfeng Li, Bo Wang, Ye Li
Title: LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking
Abstract:
Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model's utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at https://github.com/LiYunfengLYF/LightFC-X.
中文:LightFC-X提出了一种轻量级多模态跟踪框架,通过高效的跨模态注意力和时空模块,在减少参数的同时实现了顶尖性能,并能在CPU上实时运行。
English: LightFC-X introduces a lightweight multimodal tracking framework with efficient cross-attention and spatiotemporal modules, achieving state-of-the-art performance with reduced parameters and real-time CPU speed.

Authors:Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
Title: SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Abstract:
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.
中文: SpargeAttn提出了一种通用的稀疏量化注意力机制,通过两阶段在线过滤跳过冗余计算,在加速语言、图像和视频生成等多种模型的同时保持端到端性能不损失。
English: SpargeAttn introduces a universal sparse and quantized attention mechanism that accelerates diverse models across language, image, and video generation without compromising performance by employing a two-stage online filter to skip unnecessary computations.

Authors:Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
Title: SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Abstract:
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The codes are available at https://github.com/thu-ml/SpargeAttn.
中文: SpargeAttn提出了一种通用的稀疏量化注意力机制,通过两阶段在线过滤跳过冗余计算,在加速语言、图像和视频生成等多种模型的同时保持端到端性能不损失。
English: SpargeAttn introduces a universal sparse and quantized attention mechanism that accelerates diverse models across language, image, and video generation without compromising performance by employing a two-stage online filter to skip unnecessary computations.

Authors:Laura Perez-Beltrachini, Mirella Lapata
Title: Uncertainty Quantification in Retrieval Augmented Question Answering
Abstract:
Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at https://github.com/lauhaide/ragu.
中文摘要:本研究提出了一种通过预测检索段落效用来量化检索增强问答中不确定性的方法,证明轻量级神经网络模型能有效评估答案正确性,并达到或超越昂贵采样方法的性能。
English Summary: This research introduces a method to quantify uncertainty in retrieval-augmented question answering by predicting the utility of retrieved passages, demonstrating that a lightweight neural model effectively estimates answer correctness and matches or surpasses costly sampling-based approaches.

Authors:Han Nie, Bin Luo, Jun Liu, Zhitao Fu, Huan Zhou, Shuo Zhang, Weixing Liu
Title: PromptMID: Modal Invariant Descriptors Based on Diffusion and Vision Foundation Models for Optical-SAR Image Matching
Abstract:
The ideal goal of image matching is to achieve stable and efficient performance in unseen domains. However, many existing learning-based optical-SAR image matching methods, despite their effectiveness in specific scenarios, exhibit limited generalization and struggle to adapt to practical applications. Repeatedly training or fine-tuning matching models to address domain differences is not only not elegant enough but also introduces additional computational overhead and data production costs. In recent years, general foundation models have shown great potential for enhancing generalization. However, the disparity in visual domains between natural and remote sensing images poses challenges for their direct application. Therefore, effectively leveraging foundation models to improve the generalization of optical-SAR image matching remains challenge. To address the above challenges, we propose PromptMID, a novel approach that constructs modality-invariant descriptors using text prompts based on land use classification as priors information for optical and SAR image matching. PromptMID extracts multi-scale modality-invariant features by leveraging pre-trained diffusion models and visual foundation models (VFMs), while specially designed feature aggregation modules effectively fuse features across different granularities. Extensive experiments on optical-SAR image datasets from four diverse regions demonstrate that PromptMID outperforms state-of-the-art matching methods, achieving superior results in both seen and unseen domains and exhibiting strong cross-domain generalization capabilities. The source code will be made publicly available https://github.com/HanNieWHU/PromptMID.
Chinese: PromptMID是一种创新方法,通过利用文本提示和预训练模型构建模态不变描述符来提升光学-SAR图像匹配效果,在多个区域实验中展现出卓越的跨域泛化能力,显著优于现有先进技术。
English: PromptMID is a novel method that enhances optical-SAR image matching by using text prompts and pre-trained models to create modality-invariant descriptors, achieving superior cross-domain generalization and outperforming existing techniques in diverse regions.

Authors:Cao Yuxuan, Wu Jiayang, Alistair Cheong Liang Chuen, Bryan Shan Guanrong, Theodore Lee Chong Jen, Sherman Chann Zhi Shen
Title: Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models
Abstract:
Traditional online content moderation systems struggle to classify modern multimodal means of communication, such as memes, a highly nuanced and information-dense medium. This task is especially hard in a culturally diverse society like Singapore, where low-resource languages are used and extensive knowledge on local context is needed to interpret online content. We curate a large collection of 112K memes labeled by GPT-4V for fine-tuning a VLM to classify offensive memes in Singapore context. We show the effectiveness of fine-tuned VLMs on our dataset, and propose a pipeline containing OCR, translation and a 7-billion parameter-class VLM. Our solutions reach 80.62% accuracy and 0.8192 AUROC on a held-out test set, and can greatly aid human in moderating online contents. The dataset, code, and model weights have been open-sourced at https://github.com/aliencaocao/vlm-for-memes-aisg.
中文摘要:传统内容审核系统难以处理如表情包这类多模态内容,尤其是在文化多元的新加坡,但通过在大规模数据集上微调视觉语言模型,对冒犯性表情包的分类准确率达到了80.62%。
English Summary: Traditional content moderation systems are ineffective for nuanced multimodal content like memes, especially in culturally diverse Singapore, but fine-tuning a vision-language model on a large dataset achieves 80.62% accuracy in classifying offensive memes.

Authors:Shengtian Mian, Ya Wang, Nannan Gu, Yuping Wang, Xiaoqing Li
Title: FwNet-ECA: A Classification Model Enhancing Window Attention with Global Receptive Fields via Fourier Filtering Operations
Abstract:
Windowed attention mechanisms were introduced to mitigate the issue of excessive computation inherent in global attention mechanisms. In this paper, we present FwNet-ECA, a novel method that utilizes Fourier transforms paired with learnable weight matrices to enhance the spectral features of images. This method establishes a global receptive field through Filter Enhancement and avoids the use of moving window attention. Additionally, we incorporate the Efficient Channel Attention (ECA) module to improve communication between different channels. Instead of relying on physically shifted windows, our approach leverages frequency domain enhancement to implicitly bridge information across spatial regions. We validate our model on the iCartoonFace dataset and conduct downstream tasks on ImageNet, demonstrating that our model achieves lower parameter counts and computational overheads compared to shifted window approaches, while maintaining competitive accuracy. Furthermore, our visualization operations clearly demonstrated that the Filter Enhancement technique achieves greater effectiveness in the model's shallow layers, where feature maps are relatively larger. This work offers a more efficient and effective alternative for leveraging attention mechanisms in visual processing tasks, alleviating the challenges associated with windowed attention models. Code is available at https://github.com/qingxiaoli/FwNet-ECA
中文: FwNet-ECA提出了一种结合傅里叶变换与可学习权重的方法来增强图像频谱特征,通过滤波器增强建立全局感受野,无需移动窗口注意力即可在降低参数量的同时保持竞争力准确率。
English: FwNet-ECA introduces a method using Fourier transforms and learnable weights to enhance image spectral features, establishing a global receptive field without windowed attention while reducing parameters and maintaining competitive accuracy on benchmark datasets.

Authors:Ankita Raj, Deepankar Varma, Chetan Arora
Title: Examining the Threat Landscape: Foundation Models and Model Stealing
Abstract:
Foundation models (FMs) for computer vision learn rich and robust representations, enabling their adaptation to task/domain-specific deployments with little to no fine-tuning. However, we posit that the very same strength can make applications based on FMs vulnerable to model stealing attacks. Through empirical analysis, we reveal that models fine-tuned from FMs harbor heightened susceptibility to model stealing, compared to conventional vision architectures like ResNets. We hypothesize that this behavior is due to the comprehensive encoding of visual patterns and features learned by FMs during pre-training, which are accessible to both the attacker and the victim. We report that an attacker is able to obtain 94.28% agreement (matched predictions with victim) for a Vision Transformer based victim model (ViT-L/16) trained on CIFAR-10 dataset, compared to only 73.20% agreement for a ResNet-18 victim, when using ViT-L/16 as the thief model. We arguably show, for the first time, that utilizing FMs for downstream tasks may not be the best choice for deployment in commercial APIs due to their susceptibility to model theft. We thereby alert model owners towards the associated security risks, and highlight the need for robust security measures to safeguard such models against theft. Code is available at https://github.com/rajankita/foundation_model_stealing.
中文: 计算机视觉基础模型虽然能轻松适应特定任务,但极易遭受模型窃取攻击,攻击者在微调模型上可获得高达94.28%的预测一致性,这对商业部署构成严重安全威胁。
English: Foundation models in computer vision, while enabling easy adaptation to specific tasks, are highly vulnerable to model stealing attacks, with attackers achieving up to 94.28% prediction agreement on fine-tuned models, highlighting significant security risks for commercial deployments.

Authors:Carlos Vélez García, Miguel Cazorla, Jorge Pomares
Title: Escaping The Big Data Paradigm in Self-Supervised Representation Learning
Abstract:
The reliance on large-scale datasets and extensive computational resources has become a major barrier to advancing representation learning in vision, especially in data-scarce domains. In this paper, we address the critical question: Can we escape the big data paradigm in self-supervised representation learning from images? We introduce SCOTT (Sparse Convolutional Tokenizer for Transformers), a shallow tokenization architecture that is compatible with Masked Image Modeling (MIM) tasks. SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes. Alongside, we propose to train on a Joint-Embedding Predictive Architecture within a MIM framework (MIM-JEPA), operating in latent representation space to capture more semantic features. Our approach enables ViTs to be trained from scratch on datasets orders of magnitude smaller than traditionally required --without relying on massive external datasets for pretraining. We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100. Despite the challenges of limited data and high intra-class similarity, frozen SCOTT models pretrained with MIM-JEPA significantly outperform fully supervised methods and achieve competitive results with SOTA approaches that rely on large-scale pretraining, complex image augmentations and bigger model sizes. By demonstrating that robust off-the-shelf representations can be learned with limited data, compute, and model sizes, our work paves the way for computer applications in resource constrained environments such as medical imaging or robotics. Our findings challenge the prevailing notion that vast amounts of data are indispensable for effective representation learning in vision, offering a new pathway toward more accessible and inclusive advancements in the field.
中文: 本文提出的SCOTT架构与MIM-JEPA框架相结合,使视觉Transformer能在小规模数据集上无需大规模预训练即可学习到强大表征,突破了计算机视觉领域必须依赖大数据的传统范式。
English: This paper introduces SCOTT, a shallow tokenization architecture integrated with MIM-JEPA, enabling Vision Transformers to learn robust representations from small datasets without large-scale pretraining, challenging the necessity of big data in computer vision.

Authors:Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang Xiao, Dingqi Yang, Yikun Ban, Hailong Sun, Philip S. Yu
Title: Harnessing Multiple Large Language Models: A Survey on LLM Ensemble
Abstract:
LLM Ensemble -- which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths -- has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of the methods under the broad categories of "ensemble-before-inference, ensemble-during-inference, ensemble-after-inference'', and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at https://github.com/junchenzhi/Awesome-LLM-Ensemble.
中文: 本文首次系统综述了大语言模型集成方法,将其分类为推理前、推理中和推理后集成,并探讨了相关基准、应用及未来研究方向。
English: This paper provides the first systematic review of LLM Ensemble, categorizing methods into ensemble-before, during, and after-inference, and discusses benchmarks, applications, and future research directions.

Authors:Zhuo Chen, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinyu Geng, Pengjun Xie, Fei Huang, Kewei Tu
Title: Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference
Abstract:
Despite the advancements made in Vision Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tune a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM's knowledge boundary, based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at https://github.com/Chord-Chen-30/VLLM-KnowledgeBoundary
Chinese Summary: 本研究提出了一种识别视觉大语言模型知识边界的方法,通过选择性使用检索技术减少不必要的检索,同时在多种视觉问答任务中保持或提升性能。
English Summary: This study introduces a method to identify the knowledge boundaries of Vision Large Language Models (VLLMs), enabling selective use of retrieval techniques to reduce unnecessary retrievals while maintaining or enhancing performance across various Visual Question Answering tasks.

Authors:Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao
Title: ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Abstract:
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba-NLP/ViDoRAG.
中文摘要:本文提出ViDoRAG多智能体框架,通过混合检索策略和迭代推理工作流解决现有RAG方法在处理视觉文档时的不足,在ViDoSeek基准测试中性能提升超过10%。
English Summary: The abstract introduces ViDoRAG, a multi-agent framework that addresses limitations in current RAG methods for visually rich documents by employing hybrid retrieval and iterative reasoning workflows, achieving over 10% improvement on the new ViDoSeek benchmark.

Authors:Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, Xiaoyu Shen
Title: Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning
Abstract:
Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at https://github.com/EIT-NLP/Distilling-CoT-Reasoning.
中文摘要:本研究表明,针对小语言模型的思维链能力蒸馏需要定制化策略,因为小模型对推理粒度、格式和教师模型的选择响应方式与大模型不同,且更强的教师模型未必产生更好的学生模型。
English Summary: This study reveals that effective Chain-of-Thought distillation for Small Language Models requires tailored strategies, as SLMs respond differently than LLMs to granularity, format, and teacher model selection, with stronger teachers not always yielding better results.

Authors:Tianmi Ma, Jiawei Du, Wenxin Huang, Wenjie Wang, Liang Xie, Xian Zhong, Joey Tianyi Zhou
Title: Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks, yet their performance in dynamic, real-world financial environments remains underexplored. Existing approaches are limited to historical backtesting, where trading actions cannot influence market prices and agents train only on static data. To address this limitation, we present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive multi-agent trading and directly impact price dynamics. By simulating realistic bid-ask interactions, our platform enables training in scenarios that closely mirror live markets, thereby narrowing the gap between training and evaluation. Experiments reveal that LLMs struggle with numerical reasoning when given plain-text data, often overfitting to local patterns and recent values. In contrast, chart-based visualizations significantly enhance both numerical reasoning and trading performance. Furthermore, incorporating a reflection module yields additional improvements, especially with visual inputs. Evaluations on NASDAQ and CSI datasets demonstrate the superiority of our method, particularly under high volatility. All code and data are available at https://github.com/wekjsdvnm/Agent-Trading-Arena.
大语言模型在实时金融交易中因数值推理能力不足和过拟合问题表现欠佳,而代理交易竞技场通过引入可视化输入和反思模块的竞争性模拟,显著提升了交易性能,尤其在市场波动剧烈时效果更为突出。
Large language models struggle with real-time financial trading due to limitations in numerical reasoning and overfitting, but the Agent Trading Arena introduces a competitive simulation with visual inputs and reflection modules that significantly enhance performance, especially in volatile markets.

Authors:Qianying Liu, Katrina Qiyao Wang, Fei Cheng, Sadao Kurohashi
Title: Assessing Agentic Large Language Models in Multilingual National Bias
Abstract:
Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM's applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation. We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal that local language bias is prevalent across different tasks, with GPT-4 and Sonnet reducing bias for English-speaking countries compared to GPT-3.5 but failing to achieve robust multilingual alignment, highlighting broader implications for multilingual AI agents and applications such as education. \footnote{Code available at: https://github.com/yiyunya/assess_agentic_national_bias
中文摘要:本研究揭示大型语言模型在跨语言决策任务中存在普遍的本土语言偏好,尽管新版模型有所改进,但仍无法实现稳健的多语言对齐,这对多语言AI应用具有重要影响。
English Summary: This study investigates multilingual bias in large language models, revealing persistent local language preferences across decision-making tasks despite some improvements in newer models, which fail to achieve robust cross-language alignment.

Authors:Haitao Li, Jiaying Ye, Yiran Hu, Jia Chen, Qingyao Ai, Yueyue Wu, Junjie Chen, Yifan Chen, Cheng Luo, Quan Zhou, Yiqun Liu
Title: CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation
Abstract:
Legal case documents play a critical role in judicial proceedings. As the number of cases continues to rise, the reliance on manual drafting of legal case documents is facing increasing pressure and challenges. The development of large language models (LLMs) offers a promising solution for automating document generation. However, existing benchmarks fail to fully capture the complexities involved in drafting legal case documents in real-world scenarios. To address this gap, we introduce CaseGen, the benchmark for multi-stage legal case documents generation in the Chinese legal domain. CaseGen is based on 500 real case samples annotated by legal experts and covers seven essential case sections. It supports four key tasks: drafting defense statements, writing trial facts, composing legal reasoning, and generating judgment results. To the best of our knowledge, CaseGen is the first benchmark designed to evaluate LLMs in the context of legal case document generation. To ensure an accurate and comprehensive evaluation, we design the LLM-as-a-judge evaluation framework and validate its effectiveness through human annotations. We evaluate several widely used general-domain LLMs and legal-specific LLMs, highlighting their limitations in case document generation and pinpointing areas for potential improvement. This work marks a step toward a more effective framework for automating legal case documents drafting, paving the way for the reliable application of AI in the legal field. The dataset and code are publicly available at https://github.com/CSHaitao/CaseGen.
中文摘要:本文提出了CaseGen,这是首个针对中文法律领域多阶段案件文书生成的基准测试,通过专家标注的真实案例和新颖的评估框架填补现有基准的不足,为AI在法律文书自动生成的可靠应用铺平道路。
English Summary: This paper introduces CaseGen, the first benchmark for evaluating large language models in multi-stage legal case document generation within the Chinese legal system, addressing gaps in existing benchmarks through expert-annotated real cases and a novel evaluation framework.

Authors:Mingyuan Sun, Zheng Fang, Jiaxu Wang, Junjie Jiang, Delei Kong, Chenming Hu, Yuetong Fang, Renjing Xu
Title: Optimal Brain Apoptosis
Abstract:
The increasing complexity and parameter count of Convolutional Neural Networks (CNNs) and Transformers pose challenges in terms of computational efficiency and resource demands. Pruning has been identified as an effective strategy to address these challenges by removing redundant elements such as neurons, channels, or connections, thereby enhancing computational efficiency without heavily compromising performance. This paper builds on the foundational work of Optimal Brain Damage (OBD) by advancing the methodology of parameter importance estimation using the Hessian matrix. Unlike previous approaches that rely on approximations, we introduce Optimal Brain Apoptosis (OBA), a novel pruning method that calculates the Hessian-vector product value directly for each parameter. By decomposing the Hessian matrix across network layers and identifying conditions under which inter-layer Hessian submatrices are non-zero, we propose a highly efficient technique for computing the second-order Taylor expansion of parameters. This approach allows for a more precise pruning process, particularly in the context of CNNs and Transformers, as validated in our experiments including VGG19, ResNet32, ResNet50, and ViT-B/16 on CIFAR10, CIFAR100 and Imagenet datasets. Our code is available at https://github.com/NEU-REAL/OBA.
中文: 本文提出了一种名为最优脑凋亡(OBA)的新剪枝方法,通过直接计算Hessian向量积来精确评估参数重要性,从而在多个数据集上验证了其在提升卷积神经网络和Transformer计算效率方面的有效性。
English: This paper introduces Optimal Brain Apoptosis (OBA), a novel pruning method that enhances computational efficiency by directly calculating Hessian-vector products for precise parameter importance estimation in CNNs and Transformers, validated on multiple datasets.

Authors:Hongyi Chen, Jingtao Ding, Xiaojun Liang, Yong Li, Xiao-Ping Zhang
Title: Structure-prior Informed Diffusion Model for Graph Source Localization with Limited Data
Abstract:
Source localization in graph information propagation is essential for mitigating network disruptions, including misinformation spread, cyber threats, and infrastructure failures. Existing deep generative approaches face significant challenges in real-world applications due to limited propagation data availability. We present SIDSL (\textbf{S}tructure-prior \textbf{I}nformed \textbf{D}iffusion model for \textbf{S}ource \textbf{L}ocalization), a generative diffusion framework that leverages topology-aware priors to enable robust source localization with limited data. SIDSL addresses three key challenges: unknown propagation patterns through structure-based source estimations via graph label propagation, complex topology-propagation relationships via a propagation-enhanced conditional denoiser with GNN-parameterized label propagation module, and class imbalance through structure-prior biased diffusion initialization. By learning pattern-invariant features from synthetic data generated by established propagation models, SIDSL enables effective knowledge transfer to real-world scenarios. Experimental evaluation on four real-world datasets demonstrates superior performance with 7.5-13.3\% F1 score improvements over baselines, including over 19\% improvement in few-shot and 40\% in zero-shot settings, validating the framework's effectiveness for practical source localization. Our code can be found \href{https://github.com/tsinghua-fib-lab/SIDSL}{here}.
中文: SIDSL是一种利用拓扑感知先验的生成扩散框架,可在有限数据下实现鲁棒的源定位,在真实场景中相比基线方法F1分数提升7.5-13.3%,验证了其实际有效性。
English: SIDSL is a generative diffusion framework that leverages topology-aware priors to enable robust source localization with limited data, demonstrating superior performance with 7.5-13.3% F1 score improvements over baselines in real-world applications.

Authors:Shiping Gao, Fanqi Wan, Jiajian Guo, Xiaojun Quan, Qifan Wang
Title: Advantage-Guided Distillation for Preference Alignment in Small Language Models
Abstract:
Alignment techniques enable Large Language Models (LLMs) to generate outputs that align with human preferences and play a crucial role in their effectiveness. However, their impact often diminishes when applied to Small Language Models (SLMs), likely due to the limited capacity of these models. Instead of directly applying existing alignment techniques to SLMs, we propose to utilize a well-aligned teacher LLM to guide the alignment process for these models, thereby facilitating the transfer of the teacher's knowledge of human preferences to the student model. To achieve this, we first explore a straightforward approach, Dual-Constrained Knowledge Distillation (DCKD), that employs knowledge distillation with two KL-divergence constraints from the aligned teacher to the unaligned student. To further enhance the student's ability to distinguish between preferred and dispreferred responses, we then propose Advantage-Guided Distillation for Preference Alignment (ADPA), which leverages an advantage function from the aligned teacher to deliver more nuanced, distribution-level reward signals for the student's alignment. Our experimental results show that these two approaches appreciably improve the alignment of SLMs and narrow the performance gap with larger counterparts. Among them, ADPA demonstrates superior performance and achieves even greater effectiveness when integrated with DCKD. Our code is available at https://github.com/SLIT-AI/ADPA.
中文: 由于小型语言模型能力有限,大语言模型的对齐技术对其效果不佳,因此我们提出DCKD和ADPA两种知识蒸馏方法,利用对齐良好的教师大模型向学生模型传递人类偏好知识,显著提升了小型模型的对齐效果并缩小了与大型模型的性能差距。
English: Alignment techniques for Large Language Models often fail with Small Language Models due to their limited capacity, so we propose two knowledge distillation methods—DCKD and ADPA—that use a well-aligned teacher LLM to transfer human preference knowledge to SLMs, significantly improving their alignment and narrowing the performance gap with larger models.

Authors:Vishal Nedungadi, Muhammad Akhtar Munir, Marc Rußwurm, Ron Sarafian, Ioannis N. Athanasiadis, Yinon Rudich, Fahad Shahbaz Khan, Salman Khan
Title: AirCast: Improving Air Pollution Forecasting Through Multi-Variable Data Alignment
Abstract:
Air pollution remains a leading global health risk, exacerbated by rapid industrialization and urbanization, contributing significantly to morbidity and mortality rates. In this paper, we introduce AirCast, a novel multi-variable air pollution forecasting model, by combining weather and air quality variables. AirCast employs a multi-task head architecture that simultaneously forecasts atmospheric conditions and pollutant concentrations, improving its understanding of how weather patterns affect air quality. Predicting extreme pollution events is challenging due to their rare occurrence in historic data, resulting in a heavy-tailed distribution of pollution levels. To address this, we propose a novel Frequency-weighted Mean Absolute Error (fMAE) loss, adapted from the class-balanced loss for regression tasks. Informed from domain knowledge, we investigate the selection of key variables known to influence pollution levels. Additionally, we align existing weather and chemical datasets across spatial and temporal dimensions. AirCast's integrated approach, combining multi-task learning, frequency weighted loss and domain informed variable selection, enables more accurate pollution forecasts. Our source code and models are made public here (https://github.com/vishalned/AirCast.git)
中文摘要:AirCast是一种创新的多变量空气污染预测模型,通过整合天气和空气质量数据,采用多任务学习架构、频率加权损失函数及领域知识驱动的变量选择方法,显著提升了污染预测的精确度。
English Summary: AirCast is a novel multi-variable air pollution forecasting model that integrates weather and air quality data through multi-task learning, frequency-weighted loss, and domain-informed variable selection to improve prediction accuracy.

Authors:Yuhan Chen, Yihong Luo, Yifan Song, Pengwen Dai, Jing Tang, Xiaochun Cao
Title: Decoupled Graph Energy-based Model for Node Out-of-Distribution Detection on Heterophilic Graphs
Abstract:
Despite extensive research efforts focused on OOD detection on images, OOD detection on nodes in graph learning remains underexplored. The dependence among graph nodes hinders the trivial adaptation of existing approaches on images that assume inputs to be i.i.d. sampled, since many unique features and challenges specific to graphs are not considered, such as the heterophily issue. Recently, GNNSafe, which considers node dependence, adapted energy-based detection to the graph domain with state-of-the-art performance, however, it has two serious issues: 1) it derives node energy from classification logits without specifically tailored training for modeling data distribution, making it less effective at recognizing OOD data; 2) it highly relies on energy propagation, which is based on homophily assumption and will cause significant performance degradation on heterophilic graphs, where the node tends to have dissimilar distribution with its neighbors. To address the above issues, we suggest training EBMs by MLE to enhance data distribution modeling and remove energy propagation to overcome the heterophily issues. However, training EBMs via MLE requires performing MCMC sampling on both node feature and node neighbors, which is challenging due to the node interdependence and discrete graph topology. To tackle the sampling challenge, we introduce DeGEM, which decomposes the learning process into two parts: a graph encoder that leverages topology information for node representations and an energy head that operates in latent space. Extensive experiments validate that DeGEM, without OOD exposure during training, surpasses previous state-of-the-art methods, achieving an average AUROC improvement of 6.71% on homophilic graphs and 20.29% on heterophilic graphs, and even outperform methods trained with OOD exposure. Our code is available at: https://github.com/draym28/DeGEM.
中文: 图学习中的节点分布外检测因节点间依赖性而研究不足,提出的DeGEM方法通过将学习过程分解为图编码器和能量头,克服了现有方法的局限,在无需分布外数据训练的情况下实现了最优性能。
English: Graph out-of-distribution (OOD) detection remains underexplored due to node interdependence, and the proposed DeGEM method overcomes limitations of prior approaches by decomposing learning into a graph encoder and energy head, achieving state-of-the-art performance without OOD exposure during training.

Authors:Mingyan Wu, Zhenghao Liu, Yukun Yan, Xinze Li, Shi Yu, Zheni Zeng, Yu Gu, Ge Yu
Title: RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts
Abstract:
Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at https://github.com/NEUIR/RankCoT.
中文: RankCoT是一种知识精炼方法,通过重排序和自反思机制生成思维链摘要,有效过滤无关文档以提升大语言模型答案的准确性。
English: RankCoT is a knowledge refinement method that enhances Large Language Models by generating Chain-of-Thought summaries through reranking and self-reflection, effectively filtering irrelevant documents to produce more accurate answers.

Authors:Rong Liu, Junye Liang, Jiaqi Yang, Jiang He, Peng Zhu
Title: Dual Classification Head Self-training Network for Cross-scene Hyperspectral Image Classification
Abstract:
Due to the difficulty of obtaining labeled data for hyperspectral images (HSIs), cross-scene classification has emerged as a widely adopted approach in the remote sensing community. It involves training a model using labeled data from a source domain (SD) and unlabeled data from a target domain (TD), followed by inferencing on the TD. However, variations in the reflectance spectrum of the same object between the SD and the TD, as well as differences in the feature distribution of the same land cover class, pose significant challenges to the performance of cross-scene classification. To address this issue, we propose a dual classification head self-training network (DHSNet). This method aligns class-wise features across domains, ensuring that the trained classifier can accurately classify TD data of different classes. We introduce a dual classification head self-training strategy for the first time in the cross-scene HSI classification field. The proposed approach mitigates domain gap while preventing the accumulation of incorrect pseudo-labels in the model. Additionally, we incorporate a novel central feature attention mechanism to enhance the model's capacity to learn scene-invariant features across domains. Experimental results on three cross-scene HSI datasets demonstrate that the proposed DHSNET significantly outperforms other state-of-the-art approaches. The code for DHSNet will be available at https://github.com/liurongwhm.
Chinese: 针对跨场景高光谱图像分类中的域差异问题,DHSNet首次提出双分类头自训练网络并引入中心特征注意力机制,在三个数据集上显著优于现有先进方法。
English: To overcome domain discrepancies in cross-scene hyperspectral image classification, DHSNet introduces a dual classification head self-training strategy with a central feature attention mechanism, achieving superior performance on benchmark datasets.

Authors:Runzhong Wang, Rui-Xi Wang, Mrunali Manjrekar, Connor W. Coley
Title: Neural Graph Matching Improves Retrieval Augmented Generation in Molecular Machine Learning
Abstract:
Molecular machine learning has gained popularity with the advancements of geometric deep learning. In parallel, retrieval-augmented generation has become a principled approach commonly used with language models. However, the optimal integration of retrieval augmentation into molecular machine learning remains unclear. Graph neural networks stand to benefit from clever matching to understand the structural alignment of retrieved molecules to a query molecule. Neural graph matching offers a compelling solution by explicitly modeling node and edge affinities between two structural graphs while employing a noise-robust, end-to-end neural network to learn affinity metrics. We apply this approach to mass spectrum simulation and introduce MARASON, a novel model that incorporates neural graph matching to enhance a fragmentation-based neural network. Experimental results highlight the effectiveness of our design, with MARASON achieving 28% top-1 accuracy, a substantial improvement over the non-retrieval state-of-the-art accuracy of 19%. Moreover, MARASON outperforms both naive retrieval-augmented generation methods and traditional graph matching approaches. Code is publicly available at https://github.com/coleygroup/ms-pred
中文摘要:MARASON模型通过神经图匹配技术将检索增强生成应用于分子机器学习,在质谱模拟任务中实现了28%的准确率,较无检索增强的现有最优方法19%有显著提升。
English Summary: Neural graph matching is applied to molecular machine learning through the MARASON model, significantly improving mass spectrum simulation accuracy to 28% compared to 19% without retrieval augmentation.

Authors:Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang
Title: LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Abstract:
Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR$^2$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR$^2$Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. Our extensive evaluation on both conventional LLMs and LRMs reveals that even the most advanced LRMs, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR$^2$Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs.
Chinese: 大型推理模型(LRMs)虽提升了LLMs的推理能力,但评估其反思能力仍具挑战,为此推出的LR²Bench基准测试显示,即使顶尖LRMs如DeepSeek-R1和OpenAI o1-preview也表现不佳,平均准确率仅20.0%和23.6%,表明当前模型反思推理能力亟待提升。
English: Large Reasoning Models (LRMs) have advanced reasoning in LLMs, but evaluating their reflection capabilities remains challenging, prompting the introduction of LR²Bench, a benchmark that reveals even top LRMs like DeepSeek-R1 and OpenAI o1-preview struggle with only 20.0% and 23.6% average scores, highlighting significant room for improvement.

Authors:Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji
Title: MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
Abstract:
Multimodal large language models (MLLMs) equipped with Retrieval Augmented Generation (RAG) leverage both their rich parametric knowledge and the dynamic, external knowledge to excel in tasks such as Question Answering. While RAG enhances MLLMs by grounding responses in query-relevant external knowledge, this reliance poses a critical yet underexplored safety risk: knowledge poisoning attacks, where misinformation or irrelevant knowledge is intentionally injected into external knowledge bases to manipulate model outputs to be incorrect and even harmful. To expose such vulnerabilities in multimodal RAG, we propose MM-PoisonRAG, a novel knowledge poisoning attack framework with two attack strategies: Localized Poisoning Attack (LPA), which injects query-specific misinformation in both text and images for targeted manipulation, and Globalized Poisoning Attack (GPA) to provide false guidance during MLLM generation to elicit nonsensical responses across all queries. We evaluate our attacks across multiple tasks, models, and access settings, demonstrating that LPA successfully manipulates the MLLM to generate attacker-controlled answers, with a success rate of up to 56% on MultiModalQA. Moreover, GPA completely disrupts model generation to 0% accuracy with just a single irrelevant knowledge injection. Our results highlight the urgent need for robust defenses against knowledge poisoning to safeguard multimodal RAG frameworks.
中文摘要:多模态检索增强生成系统面临知识投毒攻击的安全隐患,攻击者通过注入误导性内容操控模型输出,MM-PoisonRAG框架实验显示其定向攻击成功率最高可达56%。
English Summary: Multimodal RAG systems face critical safety risks from knowledge poisoning attacks, where adversaries inject misleading content to manipulate model outputs, as demonstrated by the MM-PoisonRAG framework achieving up to 56% targeted attack success.

Authors:Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji
Title: MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
Abstract:
Multimodal large language models with Retrieval Augmented Generation (RAG) have significantly advanced tasks such as multimodal question answering by grounding responses in external text and images. This grounding improves factuality, reduces hallucination, and extends reasoning beyond parametric knowledge. However, this reliance on external knowledge poses a critical yet underexplored safety risk: knowledge poisoning attacks, where adversaries deliberately inject adversarial multimodal content into external knowledge bases to steer model toward generating incorrect or even harmful responses. To expose such vulnerabilities, we propose MM-PoisonRAG, the first framework to systematically design knowledge poisoning in multimodal RAG. We introduce two complementary attack strategies: Localized Poisoning Attack (LPA), which implants targeted multimodal misinformation to manipulate specific queries, and Globalized Poisoning Attack (GPA), which inserts a single adversarial knowledge to broadly disrupt reasoning and induce nonsensical responses across all queries. Comprehensive experiments across tasks, models, and access settings show that LPA achieves targeted manipulation with attack success rates of up to 56%, while GPA completely disrupts model generation to 0% accuracy with just a single adversarial knowledge injection. Our results reveal the fragility of multimodal RAG and highlight the urgent need for defenses against knowledge poisoning.
中文摘要:多模态检索增强生成系统面临知识投毒攻击的安全隐患,攻击者通过注入误导性内容操控模型输出,MM-PoisonRAG框架实验显示其定向攻击成功率最高可达56%。
English Summary: Multimodal RAG systems face critical safety risks from knowledge poisoning attacks, where adversaries inject misleading content to manipulate model outputs, as demonstrated by the MM-PoisonRAG framework achieving up to 56% targeted attack success.

Authors:Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu
Title: Can Multimodal LLMs Perform Time Series Anomaly Detection?
Abstract:
Large language models (LLMs) have been increasingly used in time series analysis. However, the potential of multimodal LLMs (MLLMs), particularly vision-language models, for time series remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. Motivated by this, we raise a critical and practical research question: Can multimodal LLMs perform time series anomaly detection? To answer this, we propose VisualTimeAnomaly benchmark to evaluate MLLMs in time series anomaly detection (TSAD). Our approach transforms time series numerical data into the image format and feed these images into various MLLMs, including proprietary models (GPT-4o and Gemini-1.5) and open-source models (LLaVA-NeXT and Qwen2-VL), each with one larger and one smaller variant. In total, VisualTimeAnomaly contains 12.4k time series images spanning 3 scenarios and 3 anomaly granularities with 9 anomaly types across 8 MLLMs. Starting with the univariate case (point- and range-wise anomalies), we extend our evaluation to more practical scenarios, including multivariate and irregular time series scenarios, and variate-wise anomalies. Our study reveals several key insights: 1) MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies. 2) MLLMs are highly robust to irregular time series, even with 25% of the data missing. 3) Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series. To the best of our knowledge, this is the first work to comprehensively investigate MLLMs for TSAD, particularly for multivariate and irregular time series scenarios. We release our dataset and code at https://github.com/mllm-ts/VisualTimeAnomaly to support future research.
Chinese: 本研究提出VisualTimeAnomaly基准,通过将时间序列数值数据转换为图像来评估多模态大语言模型在异常检测中的表现,发现模型能有效检测范围和变量级异常、对不规则数据具有强鲁棒性,且开源模型在单变量场景下与商业模型性能相当。
English: This study introduces the VisualTimeAnomaly benchmark to evaluate multimodal large language models (MLLMs) on time series anomaly detection by converting numerical data into images, revealing that MLLMs effectively detect range- and variate-wise anomalies, show robustness to irregular data, and that open-source models perform comparably to proprietary ones in univariate cases.

Authors:Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, Liefeng Bo
Title: LAM: Large Avatar Model for One-shot Animatable Gaussian Head
Abstract:
We present LAM, an innovative Large Avatar Model for animatable Gaussian head reconstruction from a single image. Unlike previous methods that require extensive training on captured video sequences or rely on auxiliary neural networks for animation and rendering during inference, our approach generates Gaussian heads that are immediately animatable and renderable. Specifically, LAM creates an animatable Gaussian head in a single forward pass, enabling reenactment and rendering without additional networks or post-processing steps. This capability allows for seamless integration into existing rendering pipelines, ensuring real-time animation and rendering across a wide range of platforms, including mobile phones. The centerpiece of our framework is the canonical Gaussian attributes generator, which utilizes FLAME canonical points as queries. These points interact with multi-scale image features through a Transformer to accurately predict Gaussian attributes in the canonical space. The reconstructed canonical Gaussian avatar can then be animated utilizing standard linear blend skinning (LBS) with corrective blendshapes as the FLAME model did and rendered in real-time on various platforms. Our experimental results demonstrate that LAM outperforms state-of-the-art methods on existing benchmarks. Our code and video are available at https://aigc3d.github.io/projects/LAM/
中文: LAM是一种创新的大型虚拟化身模型,通过单张图像一次性生成可动画化的高斯头部,无需额外网络即可实现跨平台实时动画与渲染。
English: LAM is a novel Large Avatar Model that reconstructs animatable Gaussian heads from a single image in one forward pass, enabling real-time animation and rendering across multiple platforms without additional networks.

Authors:Hyeonjeong Ha, Xiaomeng Jin, Jeonghwan Kim, Jiateng Liu, Zhenhailong Wang, Khanh Duy Nguyen, Ansel Blume, Nanyun Peng, Kai-Wei Chang, Heng Ji
Title: SYNTHIA: Novel Concept Design with Affordance Composition
Abstract:
Text-to-image (T2I) models enable rapid concept design, making them widely used in AI-driven design. While recent studies focus on generating semantic and stylistic variations of given design concepts, functional coherence--the integration of multiple affordances into a single coherent concept--remains largely overlooked. In this paper, we introduce SYNTHIA, a framework for generating novel, functionally coherent designs based on desired affordances. Our approach leverages a hierarchical concept ontology that decomposes concepts into parts and affordances, serving as a crucial building block for functionally coherent design. We also develop a curriculum learning scheme based on our ontology that contrastively fine-tunes T2I models to progressively learn affordance composition while maintaining visual novelty. To elaborate, we (i) gradually increase affordance distance, guiding models from basic concept-affordance association to complex affordance compositions that integrate parts of distinct affordances into a single, coherent form, and (ii) enforce visual novelty by employing contrastive objectives to push learned representations away from existing concepts. Experimental results show that SYNTHIA outperforms state-of-the-art T2I models, demonstrating absolute gains of 25.1% and 14.7% for novelty and functional coherence in human evaluation, respectively.
Chinese: 本文提出SYNTHIA框架,通过分层概念本体和课程学习,将多种功能整合到单一概念中生成新颖且功能一致的设计,在人类评估中显著超越了现有T2I模型的新颖性和功能一致性表现。
English: This paper introduces SYNTHIA, a framework that uses a hierarchical concept ontology and curriculum learning to generate novel, functionally coherent designs by integrating multiple affordances into a single concept, significantly outperforming existing T2I models in both novelty and functional coherence.

Authors:Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari
Title: Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
Abstract:
We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.
中文: 我们提出的基于梯度的学习方法通过最大似然优化边权重来从程序性活动中学习任务图,在多个数据集上实现了显著的F1分数提升,并在程序理解基准测试中表现出色。
English: Our gradient-based method learns task graphs from procedural activities by optimizing edge weights through maximum likelihood, achieving significant F1-score improvements on multiple datasets and excelling in procedure understanding benchmarks.

Authors:Shinwoo Park, Hyundong Jin, Jeong-won Cha, Yo-Sub Han
Title: Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features
Abstract:
Recent progress in large language models (LLMs) for code generation has raised serious concerns about intellectual property protection. Malicious users can exploit LLMs to produce paraphrased versions of proprietary code that closely resemble the original. While the potential for LLM-assisted code paraphrasing continues to grow, research on detecting it remains limited, underscoring an urgent need for detection system. We respond to this need by proposing two tasks. The first task is to detect whether code generated by an LLM is a paraphrased version of original human-written code. The second task is to identify which LLM is used to paraphrase the original code. For these tasks, we construct a dataset LPcode consisting of pairs of human-written code and LLM-paraphrased code using various LLMs. We statistically confirm significant differences in the coding styles of human-written and LLM-paraphrased code, particularly in terms of naming consistency, code structure, and readability. Based on these findings, we develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code, and discover which LLM is used for the paraphrasing. LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_paraphrased_code_via_coding_style_features.
中文: 本研究针对大型语言模型辅助代码改写的威胁,提出了检测系统LPcodedec,通过分析代码风格特征来识别改写代码及其来源模型,在检测性能和速度上均显著优于现有基准方法。
English: This study addresses the growing threat of LLM-assisted code paraphrasing by introducing a detection system, LPcodedec, which identifies paraphrased code and its source model through coding style analysis, achieving significant performance and speed improvements over existing methods.

Authors:Sushmita Sarker, Prithul Sarker, George Bebis, Alireza Tavakkoli
Title: Can Score-Based Generative Modeling Effectively Handle Medical Image Classification?
Abstract:
The remarkable success of deep learning in recent years has prompted applications in medical image classification and diagnosis tasks. While classification models have demonstrated robustness in classifying simpler datasets like MNIST or natural images such as ImageNet, this resilience is not consistently observed in complex medical image datasets where data is more scarce and lacks diversity. Moreover, previous findings on natural image datasets have indicated a potential trade-off between data likelihood and classification accuracy. In this study, we explore the use of score-based generative models as classifiers for medical images, specifically mammographic images. Our findings suggest that our proposed generative classifier model not only achieves superior classification results on CBIS-DDSM, INbreast and Vin-Dr Mammo datasets, but also introduces a novel approach to image classification in a broader context. Our code is publicly available at https://github.com/sushmitasarker/sgc_for_medical_image_classification
中文摘要: 本研究提出了一种基于分数的生成式分类器,在医学乳腺摄影数据集上实现了优越的分类性能,并为图像分类领域提供了超越传统方法的新思路。
English Summary: This study proposes a score-based generative classifier that achieves superior performance on medical mammography datasets while offering a novel approach to image classification beyond traditional methods.

Authors:Ruxiao Chen, Chenguang Wang, Yuran Sun, Xilei Zhao, Susu Xu
Title: From Perceptions to Decisions: Wildfire Evacuation Decision Prediction with Behavioral Theory-informed LLMs
Abstract:
Evacuation decision prediction is critical for efficient and effective wildfire response by helping emergency management anticipate traffic congestion and bottlenecks, allocate resources, and minimize negative impacts. Traditional statistical methods for evacuation decision prediction fail to capture the complex and diverse behavioral logic of different individuals. In this work, for the first time, we introduce FLARE, short for facilitating LLM for advanced reasoning on wildfire evacuation decision prediction, a Large Language Model (LLM)-based framework that integrates behavioral theories and models to streamline the Chain-of-Thought (CoT) reasoning and subsequently integrate with memory-based Reinforcement Learning (RL) module to provide accurate evacuation decision prediction and understanding. Our proposed method addresses the limitations of using existing LLMs for evacuation behavioral predictions, such as limited survey data, mismatching with behavioral theory, conflicting individual preferences, implicit and complex mental states, and intractable mental state-behavior mapping. Experiments on three post-wildfire survey datasets show an average of 20.47% performance improvement over traditional theory-informed behavioral models, with strong cross-event generalizability. Our complete code is publicly available at https://github.com/SusuXu-s-Lab/FLARE
Chinese: FLARE框架通过结合行为理论与大语言模型推理及强化学习,改进了野火疏散决策预测,相比传统模型性能提升20.47%,有效解决了数据不足和理论不匹配等局限性。
English: The FLARE framework enhances wildfire evacuation decision prediction by integrating behavioral theories with LLM-based reasoning and reinforcement learning, achieving a 20.47% performance improvement over traditional models while addressing data and theory limitations.

Authors:Lei Cheng, Lihao Guo, Tianya Zhang, Tam Bang, Austin Harris, Mustafa Hajij, Mina Sartipi, Siyang Cao
Title: CalibRefine: Deep Learning-Based Online Automatic Targetless LiDAR-Camera Calibration with Iterative and Attention-Driven Post-Refinement
Abstract:
Accurate multi-sensor calibration is essential for deploying robust perception systems in applications such as autonomous driving and intelligent transportation. Existing LiDAR-camera calibration methods often rely on manually placed targets, preliminary parameter estimates, or intensive data preprocessing, limiting their scalability and adaptability in real-world settings. In this work, we propose a fully automatic, targetless, and online calibration framework, CalibRefine, which directly processes raw LiDAR point clouds and camera images. Our approach is divided into four stages: (1) a Common Feature Discriminator that leverages relative spatial positions, visual appearance embeddings, and semantic class cues to identify and generate reliable LiDAR-camera correspondences, (2) a coarse homography-based calibration that uses the matched feature correspondences to estimate an initial transformation between the LiDAR and camera frames, serving as the foundation for further refinement, (3) an iterative refinement to incrementally improve alignment as additional data frames become available, and (4) an attention-based refinement that addresses non-planar distortions by leveraging a Vision Transformer and cross-attention mechanisms. Extensive experiments on two urban traffic datasets demonstrate that CalibRefine achieves high-precision calibration with minimal human input, outperforming state-of-the-art targetless methods and matching or surpassing manually tuned baselines. Our results show that robust object-level feature matching, combined with iterative refinement and self-supervised attention-based refinement, enables reliable sensor alignment in complex real-world conditions without ground-truth matrices or elaborate preprocessing. Code is available at https://github.com/radar-lab/Lidar_Camera_Automatic_Calibration
中文:CalibRefine是一种全自动、无目标且在线的标定框架,通过鲁棒的特征匹配和迭代优化实现高精度LiDAR-相机校准,在实际场景中超越了现有方法。
English: CalibRefine is a fully automatic, targetless, and online framework that achieves high-precision LiDAR-camera calibration through robust feature matching and iterative refinement, outperforming existing methods in real-world conditions.

Authors:Taos Transue, Bao Wang
Title: Learning Decentralized Swarms Using Rotation Equivariant Graph Neural Networks
Abstract:
The orchestration of agents to optimize a collective objective without centralized control is challenging yet crucial for applications such as controlling autonomous fleets, and surveillance and reconnaissance using sensor networks. Decentralized controller design has been inspired by self-organization found in nature, with a prominent source of inspiration being flocking; however, decentralized controllers struggle to maintain flock cohesion. The graph neural network (GNN) architecture has emerged as an indispensable machine learning tool for developing decentralized controllers capable of maintaining flock cohesion, but they fail to exploit the symmetries present in flocking dynamics, hindering their generalizability. We enforce rotation equivariance and translation invariance symmetries in decentralized flocking GNN controllers and achieve comparable flocking control with 70% less training data and 75% fewer trainable weights than existing GNN controllers without these symmetries enforced. We also show that our symmetry-aware controller generalizes better than existing GNN controllers. Code and animations are available at http://github.com/Utah-Math-Data-Science/Equivariant-Decentralized-Controllers.
中文摘要:研究人员开发了一种基于图神经网络的分散式集群控制器,通过强制旋转等变和平移不变对称性,在减少70%训练数据和75%参数的情况下实现相当性能,并展现出更优的泛化能力。
English Summary: Researchers developed a decentralized flocking controller using graph neural networks that enforces rotation equivariance and translation invariance, achieving comparable performance with 70% less training data and 75% fewer parameters while demonstrating superior generalization capabilities.

Authors:Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman
Title: Synthetic Text Generation for Training Large Language Models via Gradient Matching
Abstract:
Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.
中文: 本研究提出了一种基于ADMM的理论严谨方法,可生成人类可读的合成文本,在保证收敛性、性能和隐私的前提下用于微调大语言模型,并在多项分类任务中得到验证。
English: This study introduces a theoretically rigorous method using ADMM to generate human-readable synthetic text that ensures convergence, performance, and privacy for fine-tuning LLMs, validated across multiple classification tasks.

Authors:Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang
Title: MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Abstract:
Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging strategy that merges the selected and non-selected ones to preserve information from the entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed, while maintaining or enhancing performance on various multimodal tasks in long-context settings, including multi-images and long-video scenarios. Our code is released at https://github.com/AIoT-MLSys-Lab/MEDA.
中文: MEDA提出了一种基于跨模态注意力熵的动态分层KV缓存分配方法,在保持多模态长上下文模型性能的同时,显著降低了内存使用并提升了解码速度。
English: MEDA introduces a dynamic layer-wise KV cache allocation method using cross-modal attention entropy to significantly reduce memory usage and accelerate decoding in multimodal long-context models while maintaining performance.

Authors:Peijie Zhao, Zunayed Arefin, Felipe Meneguzzi, Ramon Fraga Pereira
Title: Intention Recognition in Real-Time Interactive Navigation Maps
Abstract:
In this demonstration, we develop IntentRec4Maps, a system to recognise users' intentions in interactive maps for real-world navigation. IntentRec4Maps uses the Google Maps Platform as the real-world interactive map, and a very effective approach for recognising users' intentions in real-time. We showcase the recognition process of IntentRec4Maps using two different Path-Planners and a Large Language Model (LLM). GitHub: https://github.com/PeijieZ/IntentRec4Maps
Chinese: 我们推出了IntentRec4Maps系统,该系统基于谷歌地图平台和大语言模型实时识别用户在交互式地图中的意图,并通过两种路径规划器进行了演示验证。
English: We introduce IntentRec4Maps, a real-time system that recognizes user intentions on interactive maps using the Google Maps Platform and a Large Language Model, demonstrated with two Path-Planners.

Authors:Liping Lu, Zhican He, Duanfeng Chu, Rukang Wang, Saiqian Peng, Pan Zhou
Title: ConvoyLLM: Dynamic Multi-Lane Convoy Control Using LLMs
Abstract:
This paper proposes a novel method for multi-lane convoy formation control that uses large language models (LLMs) to tackle coordination challenges in dynamic highway environments. Each connected and autonomous vehicle in the convoy uses a knowledge-driven approach to make real-time adaptive decisions based on various scenarios. Our method enables vehicles to dynamically perform tasks, including obstacle avoidance, convoy joining/leaving, and escort formation switching, all while maintaining the overall convoy structure. We design a Interlaced formation control strategy based on locally dynamic distributed graphs, ensuring the convoy remains stable and flexible. We conduct extensive experiments in the SUMO simulation platform across multiple traffic scenarios, and the results demonstrate that the proposed method is effective, robust, and adaptable to dynamic environments. The code is available at: https://github.com/chuduanfeng/ConvoyLLM.
中文: 本文提出了一种基于大语言模型的多车道车队编队控制新方法,通过实时自适应决策在动态高速环境中实现稳定灵活的车队管理,多种交通场景下的仿真实验验证了其有效性。
English: This paper introduces a novel multi-lane convoy formation control method using large language models for real-time adaptive decision-making in dynamic highway environments, demonstrating effectiveness through extensive simulations in various traffic scenarios.

Authors:Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray
Title: Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Abstract:
Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap-the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.
中文: 本综述分析了为应对数据污染风险从静态基准测试向动态基准测试的转变,指出了现有评估标准的不足,并为动态基准测试提出了优化设计原则。
English: This survey analyzes the shift from static to dynamic benchmarking in large language models to address data contamination risks, identifies gaps in current evaluation standards, and proposes optimal design principles for dynamic benchmarks.

Authors:Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray
Title: Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Abstract:
Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap-the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.
中文: 本综述分析了为应对数据污染风险从静态基准测试向动态基准测试的转变,指出了现有评估标准的不足,并为动态基准测试提出了优化设计原则。
English: This survey analyzes the shift from static to dynamic benchmarking in large language models to address data contamination risks, identifies gaps in current evaluation standards, and proposes optimal design principles for dynamic benchmarks.

Authors:François Charton
Title: Int2Int: a framework for mathematics with transformers
Abstract:
This paper documents Int2Int, an open source code base for using transformers on problems of mathematical research, with a focus on number theory and other problems involving integers. Int2Int is a complete PyTorch implementation of a transformer architecture, together with training and evaluation loops, and classes and functions to represent, generate and decode common mathematical objects. Ancillary code for data preparation, and Jupyter Notebooks for visualizing experimental results are also provided. This document presents the main features of Int2Int, serves as its user manual, and provides guidelines on how to extend it. Int2Int is released under the MIT licence, at https://github.com/f-charton/Int2Int.
中文: Int2Int是一个基于PyTorch的开源Transformer框架,专注于数论等整数相关数学研究,提供完整实现、训练工具及可视化支持。
English: Int2Int is an open-source PyTorch-based transformer framework designed for mathematical research, particularly in number theory, providing complete implementation, training tools, and visualization support for integer-related problems.

Authors:Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei Wang
Title: Protein Large Language Models: A Comprehensive Survey
Abstract:
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia-Xiao/Protein-LLM-Survey.
中文摘要:本研究首次系统综述了蛋白质大语言模型,涵盖其架构、应用与挑战,确立了其在推动蛋白质科学发展的关键工具地位。
English Summary: This work presents the first comprehensive survey of Protein LLMs, detailing their architectures, applications, and challenges while positioning them as essential tools for advancing protein science.

Authors:Yushi Zhang, Shuai Su, Yong Wang, Yanzhong Yao
Title: Hard constraint learning approaches with trainable influence functions for evolutionary equations
Abstract:
This paper develops a novel deep learning approach for solving evolutionary equations, which integrates sequential learning strategies with an enhanced hard constraint strategy featuring trainable parameters, addressing the low computational accuracy of standard Physics-Informed Neural Networks (PINNs) in large temporal domains.Sequential learning strategies divide a large temporal domain into multiple subintervals and solve them one by one in a chronological order, which naturally respects the principle of causality and improves the stability of the PINN solution. The improved hard constraint strategy strictly ensures the continuity and smoothness of the PINN solution at time interval nodes, and at the same time passes the information from the previous interval to the next interval, which avoids the incorrect/trivial solution at the position far from the initial time. Furthermore, by investigating the requirements of different types of equations on hard constraints, we design a novel influence function with trainable parameters for hard constraints, which provides theoretical and technical support for the effective implementations of hard constraint strategies, and significantly improves the universality and computational accuracy of our method. In addition, an adaptive time-domain partitioning algorithm is proposed, which plays an important role in the application of the proposed method as well as in the improvement of computational efficiency and accuracy. Numerical experiments verify the performance of the method. The data and code accompanying this paper are available at https://github.com/zhizhi4452/HCS.
Chinese: 本文提出了一种新颖的深度学习算法,通过结合序列学习与具有可训练参数的改进硬约束策略,有效解决了物理信息神经网络在大时间域中计算精度不足的问题。
English: This paper introduces a novel deep learning method that combines sequential learning with an enhanced hard constraint strategy featuring trainable parameters to significantly improve the computational accuracy of Physics-Informed Neural Networks for evolutionary equations over large temporal domains.

Authors:Ruoyu Guo, Haochen Qiu
Title: Pursuing Top Growth with Novel Loss Function
Abstract:
Making consistently profitable financial decisions in a continuously evolving and volatile stock market has always been a difficult task. Professionals from different disciplines have developed foundational theories to anticipate price movement and evaluate securities such as the famed Capital Asset Pricing Model (CAPM). In recent years, the role of artificial intelligence (AI) in asset pricing has been growing. Although the black-box nature of deep learning models lacks interpretability, they have continued to solidify their position in the financial industry. We aim to further enhance AI's potential and utility by introducing a return-weighted loss function that will drive top growth while providing the ML models a limited amount of information. Using only publicly accessible stock data (open/close/high/low, trading volume, sector information) and several technical indicators constructed from them, we propose an efficient daily trading system that detects top growth opportunities. Our best models achieve 61.73% annual return on daily rebalancing with an annualized Sharpe Ratio of 1.18 over 1340 testing days from 2019 to 2024, and 37.61% annual return with an annualized Sharpe Ratio of 0.97 over 1360 testing days from 2005 to 2010. The main drivers for success, especially independent of any domain knowledge, are the novel return-weighted loss function, the integration of categorical and continuous data, and the ML model architecture. We also demonstrate the superiority of our novel loss function over traditional loss functions via several performance metrics and statistical evidence.
中文: 本研究通过引入收益加权损失函数,仅利用公开数据和衍生技术指标,在无需领域知识的情况下构建了高效AI交易系统,实现了高年化收益率与夏普比率。
English: This study introduces a return-weighted loss function to enhance AI-driven stock trading, achieving high returns and Sharpe ratios by using only public data and technical indicators without domain knowledge.

Authors:Xu Wang, Jiaju Kang, Puyu Han, Yubao Zhao, Qian Liu, Liwenfei He, Lingqiong Zhang, Lingyun Dai, Yongcheng Wang, Jie Tao
Title: ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis
Abstract:
We present ECG-Expert-QA, a comprehensive multimodal dataset for evaluating diagnostic capabilities in electrocardiogram (ECG) interpretation. It combines real-world clinical ECG data with systematically generated synthetic cases, covering 12 essential diagnostic tasks and totaling 47,211 expert-validated QA pairs. These encompass diverse clinical scenarios, from basic rhythm recognition to complex diagnoses involving rare conditions and temporal changes. A key innovation is the support for multi-turn dialogues, enabling the development of conversational medical AI systems that emulate clinician-patient or interprofessional interactions. This allows for more realistic assessment of AI models' clinical reasoning, diagnostic accuracy, and knowledge integration. Constructed through a knowledge-guided framework with strict quality control, ECG-Expert-QA ensures linguistic and clinical consistency, making it a high-quality resource for advancing AI-assisted ECG interpretation. It challenges models with tasks like identifying subtle ischemic changes and interpreting complex arrhythmias in context-rich scenarios. To promote research transparency and collaboration, the dataset, accompanying code, and prompts are publicly released at https://github.com/Zaozzz/ECG-Expert-QA
中文: ECG-Expert-QA是一个结合真实与合成心电图案例的多模态数据集,包含47,211个专家验证的问答对,支持多轮对话功能,旨在推进临床推理和诊断准确性的会话式医疗AI系统发展。
English: ECG-Expert-QA is a multimodal dataset combining real and synthetic ECG cases with 47,211 expert-validated QA pairs, featuring multi-turn dialogues to advance conversational medical AI systems for clinical reasoning and diagnostic accuracy.

Authors:Younghoon Na, Seunghun Oh, Seongji Ko, Hyunkyung Lee
Title: PixleepFlow: A Pixel-Based Lifelog Framework for Predicting Sleep Quality and Stress Level
Abstract:
The analysis of lifelogs can yield valuable insights into an individual's daily life, particularly with regard to their health and well-being. The accurate assessment of quality of life is necessitated by the use of diverse sensors and precise synchronization. To rectify this issue, this study proposes the image-based sleep quality and stress level estimation flow (PixleepFlow). PixleepFlow employs a conversion methodology into composite image data to examine sleep patterns and their impact on overall health. Experiments were conducted using lifelog datasets to ascertain the optimal combination of data formats. In addition, we identified which sensor information has the greatest influence on the quality of life through Explainable Artificial Intelligence(XAI). As a result, PixleepFlow produced more significant results than various data formats. This study was part of a written-based competition, and the additional findings from the lifelog dataset are detailed in Section Section IV. More information about PixleepFlow can be found at https://github.com/seongjiko/Pixleep.
中文: 本研究提出基于图像的PixleepFlow方法,通过将生命日志数据转换为复合图像来评估睡眠质量和压力水平,并利用可解释人工智能技术识别关键影响因素,取得了优于传统数据格式的分析效果。
English: This study introduces PixleepFlow, an image-based method that analyzes lifelog data to estimate sleep quality and stress levels, achieving superior results through composite image conversion and explainable AI techniques.

Authors:Ziyue Yang, Chengrui Chen, Yong Peng, Qiong Chen, Wanzeng Kong
Title: CSSSTN: A Class-sensitive Subject-to-subject Semantic Style Transfer Network for EEG Classification in RSVP Tasks
Abstract:
The Rapid Serial Visual Presentation (RSVP) paradigm represents a promising application of electroencephalography (EEG) in Brain-Computer Interface (BCI) systems. However, cross-subject variability remains a critical challenge, particularly for BCI-illiterate users who struggle to effectively interact with these systems. To address this issue, we propose the Class-Sensitive Subject-to-Subject Semantic Style Transfer Network (CSSSTN), which incorporates a class-sensitive approach to align feature distributions between golden subjects (BCI experts) and target (BCI-illiterate) users on a class-by-class basis. Building on the SSSTN framework, CSSSTN incorporates three key components: (1) subject-specific classifier training, (2) a unique style loss to transfer class-discriminative features while preserving semantic information through a modified content loss, and (3) an ensemble approach to integrate predictions from both source and target domains. We evaluated CSSSTN using both a publicly available dataset and a self-collected dataset. Experimental results demonstrate that CSSSTN outperforms state-of-the-art methods, achieving mean balanced accuracy improvements of 6.4\% on the Tsinghua dataset and 3.5\% on the HDU dataset, with notable benefits for BCI-illiterate users. Ablation studies confirm the effectiveness of each component, particularly the class-sensitive transfer and the use of lower-layer features, which enhance transfer performance and mitigate negative transfer. Additionally, CSSSTN achieves competitive results with minimal target data, reducing calibration time and effort. These findings highlight the practical potential of CSSSTN for real-world BCI applications, offering a robust and scalable solution to improve the performance of BCI-illiterate users while minimizing reliance on extensive training data. Our code is available at https://github.com/ziyuey/CSSSTN.
中文摘要:提出的CSSSTN模型通过类别敏感的特征迁移方法,有效解决了脑机接口系统中跨被试差异的难题,在提升BCI不熟练用户性能的同时显著减少了校准数据需求。
English Summary: The proposed CSSSTN model effectively addresses cross-subject variability in EEG-based BCI systems by transferring class-discriminative features from expert to novice users, achieving significant accuracy improvements while reducing calibration requirements.

Authors:Francesco Stefano Carzaniga, Gary Tom Hoppeler, Michael Hersche, Kaspar Anton Schindler, Abbas Rahimi
Title: The Case for Cleaner Biosignals: High-fidelity Neural Compressor Enables Transfer from Cleaner iEEG to Noisier EEG
Abstract:
All data modalities are not created equal, even when the signal they measure comes from the same source. In the case of the brain, two of the most important data modalities are the scalp electroencephalogram (EEG), and the intracranial electroencephalogram (iEEG). They are used by human experts, supported by deep learning (DL) models, to accomplish a variety of tasks, such as seizure detection and motor imagery classification. Although the differences between EEG and iEEG are well understood by human experts, the performance of DL models across these two modalities remains under-explored. To help characterize the importance of clean data on the performance of DL models, we propose BrainCodec, a high-fidelity EEG and iEEG neural compressor. We find that training BrainCodec on iEEG and then transferring to EEG yields higher reconstruction quality than training on EEG directly. In addition, we also find that training BrainCodec on both EEG and iEEG improves fidelity when reconstructing EEG. Our work indicates that data sources with higher SNR, such as iEEG, provide better performance across the board also in the medical time-series domain. BrainCodec also achieves up to a 64x compression on iEEG and EEG without a notable decrease in quality. BrainCodec markedly surpasses current state-of-the-art compression models both in final compression ratio and in reconstruction fidelity. We also evaluate the fidelity of the compressed signals objectively on a seizure detection and a motor imagery task performed by standard DL models. Here, we find that BrainCodec achieves a reconstruction fidelity high enough to ensure no performance degradation on the downstream tasks. Finally, we collect the subjective assessment of an expert neurologist, that confirms the high reconstruction quality of BrainCodec in a realistic scenario. The code is available at https://github.com/IBM/eeg-ieeg-brain-compressor.
中文摘要:BrainCodec是一种针对脑电图和颅内脑电图的高保真神经压缩器,利用高信噪比的iEEG数据训练可提升模型性能,在保持信号质量的同时实现高效压缩,确保下游医疗任务无性能损失。
English Summary: BrainCodec is a neural compressor that achieves high-fidelity compression for EEG and iEEG brain signals, demonstrating superior performance when trained on iEEG data and maintaining signal quality for downstream medical tasks.

Authors:Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He
Title: Fractal Generative Models
Abstract:
Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.
中文: 本文提出分形生成模型,通过递归调用原子生成模块构建自相似结构,在图像生成任务中表现出色,为生成建模开辟了新范式。
English: This paper introduces fractal generative models, which recursively use atomic generative modules to create self-similar architectures, demonstrating strong performance in image generation tasks and offering a new paradigm for generative modeling.

Authors:Vishal Thengane, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Lu Yin, Xiatian Zhu, Salman Khan
Title: CLIMB-3D: Continual Learning for Imbalanced 3D Instance Segmentation
Abstract:
While 3D instance segmentation (3DIS) has advanced significantly, existing methods typically assume that all object classes are known in advance and are uniformly distributed. However, this assumption is unrealistic in dynamic, real-world environments where new classes emerge gradually and exhibit natural imbalance. Although some approaches have addressed class emergence, they often overlook class imbalance, resulting in suboptimal performance -- particularly on rare categories. To tackle this challenge, we propose CLIMB-3D, a unified framework for \textbf{CL}ass-incremental \textbf{Imb}alance-aware \textbf{3D}IS. Building upon established exemplar replay (ER) strategies, we show that ER alone is insufficient to achieve robust performance under constrained memory conditions. To mitigate this, we introduce a novel pseudo-label generator (PLG) that extends supervision to previously learned categories by leveraging predictions from a frozen prior model. Despite its promise, PLG tends to bias towards frequent classes. Therefore, we propose a class-balanced re-weighting (CBR) scheme, that estimates object frequencies from pseudo-labels and dynamically adjusts training bias -- without requiring access to past data. We design and evaluate three incremental scenarios for 3DIS on the challenging ScanNet200 dataset, and additionally on semantic segmentation on ScanNetV2. Our approach achieves state-of-the-art results, surpassing prior work by up to 16.76\% mAP for instance segmentation and approximately 30\% mIoU for semantic segmentation, demonstrating strong generalization across both frequent and rare classes.
中文: CLIMB-3D 提出了一种解决三维实例分割中类别不平衡和增量学习问题的统一框架,通过结合样本回放、伪标签生成器和类别平衡重加权机制,在稀有和常见类别上均实现了最先进的性能。
English: CLIMB-3D is a novel framework addressing class imbalance and incremental learning in 3D instance segmentation by combining exemplar replay with a pseudo-label generator and class-balanced re-weighting, achieving state-of-the-art performance on rare and frequent classes.

Authors:Runpeng Yu, Xinyin Ma, Xinchao Wang
Title: Introducing Visual Perception Token into Multimodal Large Language Model
Abstract:
To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken
中文摘要:视觉感知令牌的提出使多模态大语言模型能够自主控制其视觉感知过程,通过区域选择令牌和视觉重编码令牌实现选择性图像区域审查,从而显著提升空间推理和细粒度理解等任务的性能表现。
English Summary: The proposed Visual Perception Token empowers Multimodal Large Language Models with autonomous control over visual perception processes, enabling selective region review and enhanced visual re-encoding to significantly boost performance in spatial reasoning and fine-grained understanding tasks.

Authors:Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
Title: MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.
中文摘要:研究发现多模态大语言模型在感知图像细微视觉信息方面存在不足,但其注意力机制即使回答错误时仍能准确定位关键区域;通过利用模型内部的注意力和梯度图,开发出无需训练的可视干预方法,显著提升了模型对微小视觉细节的识别准确率。
English Summary: This study reveals that Multimodal Large Language Models struggle with perceiving small visual details in images, but their attention mechanisms correctly identify relevant areas even when answering incorrectly, leading to the development of training-free intervention methods that significantly improve accuracy by leveraging the models' internal attention and gradient maps.

Authors:Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An
Title: LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
Abstract:
As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training-inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long-context understanding datasets, as well as a 2.25x reduction in wall-clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long-context applications. The code is available at https://github.com/sail-sg/LongSpec.
Chinese: LongSpec是一种创新框架,通过内存高效的草稿模型、专用位置索引和优化注意力机制,解决了现有推测解码方法在长上下文场景中的局限性,在长上下文应用中实现了高达3.26倍的加速效果。
English: LongSpec is a novel framework that overcomes the limitations of existing speculative decoding methods in long-context scenarios through memory-efficient draft models, specialized position indices, and optimized attention mechanisms, achieving up to 3.26x speedup in long-context applications.

Authors:Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, Yingying Zhang, Fei Yin, Jiahua Dong, Zhiwei Li, Bao-Long Bi, Ling-Rui Mei, Junfeng Fang, Xiao Liang, Zhijiang Guo, Le Song, Cheng-Lin Liu
Title: From System 1 to System 2: A Survey of Reasoning Large Language Models
Abstract:
Achieving human-level intelligence requires refining the transition from the fast, intuitive System 1 to the slower, more deliberate System 2 reasoning. While System 1 excels in quick, heuristic decisions, System 2 relies on logical reasoning for more accurate judgments and reduced biases. Foundational Large Language Models (LLMs) excel at fast decision-making but lack the depth for complex reasoning, as they have not yet fully embraced the step-by-step analysis characteristic of true System 2 thinking. Recently, reasoning LLMs like OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding, closely mimicking the deliberate reasoning of System 2 and showcasing human-like cognitive abilities. This survey begins with a brief overview of the progress in foundational LLMs and the early development of System 2 technologies, exploring how their combination has paved the way for reasoning LLMs. Next, we discuss how to construct reasoning LLMs, analyzing their features, the core methods enabling advanced reasoning, and the evolution of various reasoning LLMs. Additionally, we provide an overview of reasoning benchmarks, offering an in-depth comparison of the performance of representative reasoning LLMs. Finally, we explore promising directions for advancing reasoning LLMs and maintain a real-time \href{https://github.com/zzli2022/Awesome-Slow-Reason-System}{GitHub Repository} to track the latest developments. We hope this survey will serve as a valuable resource to inspire innovation and drive progress in this rapidly evolving field.
中文摘要:实现人类水平智能需要从快速直觉的系统1思维推进到深思熟虑的系统2推理,而像OpenAI的o1和DeepSeek的R1这样的推理大语言模型通过逐步分析在复杂任务中展现出专家级性能,证明了这一进步。
English Summary: Achieving human-level intelligence requires advancing from fast, intuitive System 1 reasoning to deliberate System 2 thinking, which reasoning LLMs like OpenAI's o1 and DeepSeek's R1 have demonstrated by showing expert-level performance in complex tasks through step-by-step analysis.

Authors:Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, Tuo Zhao
Title: COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs
Abstract:
Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory consumption. Subsequent research, exemplified by SOAP, attempts to better capture coordinate interdependence but incurs greater memory overhead, limiting scalability for massive LLMs. An alternative approach aims to reduce memory consumption through low-dimensional projection, but this leads to substantial approximation errors, resulting in less effective optimization (e.g., in terms of per-token efficiency). In this paper, we propose COSMOS, a novel hybrid optimizer that leverages the varying importance of eigensubspaces in the gradient matrix to achieve memory efficiency without compromising optimization performance. The design of COSMOS is motivated by our empirical insights and practical considerations. Specifically, COSMOS applies SOAP to the leading eigensubspace, which captures the primary optimization dynamics, and MUON to the remaining eigensubspace, which is less critical but computationally expensive to handle with SOAP. This hybrid strategy significantly reduces memory consumption while maintaining robust optimization performance, making it particularly suitable for massive LLMs. Numerical experiments on various datasets and transformer architectures are provided to demonstrate the effectiveness of COSMOS. Our code is available at https://github.com/lliu606/COSMOS.
中文: COSMOS是一种新型混合优化器,通过对关键特征子空间应用SOAP和对次要子空间使用MUON,在保证优化性能的同时显著降低内存消耗,特别适用于大规模语言模型。
English: COSMOS is a novel hybrid optimizer that efficiently manages memory usage by applying SOAP to critical eigensubspaces and MUON to less important ones, maintaining robust optimization performance for large language models.

Authors:André V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li
Title: DIS-CO: Discovering Copyrighted Content in VLMs Training Data
Abstract:
How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model's training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also highlight a broader concern: all tested models appear to have been exposed to some extent to copyrighted content. Our code and data are available at https://github.com/avduarte333/DIS-CO
中文: DIS-CO是一种通过向视觉语言模型输入特定帧并分析其文本补全来检测训练数据中是否包含受版权保护内容的新方法,其检测性能显著优于现有技术,并揭示所有测试模型均存在不同程度的版权内容暴露。
English: DIS-CO is a novel method that detects whether copyrighted content was used to train vision-language models by querying them with specific frames and analyzing text completions, significantly outperforming prior methods and revealing widespread exposure to such content across tested models.

Authors:Yihong Liu, Runsheng Chen, Lea Hirlimann, Ahmad Dawar Hakimi, Mingyang Wang, Amir Hossein Kargaran, Sascha Rothe, François Yvon, Hinrich Schütze
Title: On Relation-Specific Neurons in Large Language Models
Abstract:
In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself -- independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relation. To investigate this, we study the Llama-2 family on a chosen set of relations with a statistics-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation $r$ on the LLM's ability to handle (1) facts whose relation is $r$ and (2) facts whose relation is a different relation $r' \neq r$. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. $\textbf{(i) Neuron cumulativity.}$ The neurons for $r$ present a cumulative effect so that deactivating a larger portion of them results in the degradation of more facts in $r$. $\textbf{(ii) Neuron versatility.}$ Neurons can be shared across multiple closely related as well as less related relations. Some relation neurons transfer across languages. $\textbf{(iii) Neuron interference.}$ Deactivating neurons specific to one relation can improve LLM generation performance for facts of other relations. We will make our code publicly available at https://github.com/cisnlp/relation-specific-neurons.
中文: 本研究在大语言模型中识别出关系特定神经元,它们能检测文本中的关系并引导生成,通过选择性失活实验揭示了这些神经元的累积性、通用性和干扰性特征。
English: This study identifies relation-specific neurons in large language models that detect relations in text and guide generation, revealing their cumulative, versatile, and interfering properties through selective deactivation experiments.

Authors:Yihong Liu, Runsheng Chen, Lea Hirlimann, Ahmad Dawar Hakimi, Mingyang Wang, Amir Hossein Kargaran, Sascha Rothe, François Yvon, Hinrich Schütze
Title: On Relation-Specific Neurons in Large Language Models
Abstract:
In large language models (LLMs), certain \emph{neurons} can store distinct pieces of knowledge learned during pretraining. While factual knowledge typically appears as a combination of \emph{relations} and \emph{entities}, it remains unclear whether some neurons focus on a relation itself -- independent of any entity. We hypothesize such neurons \emph{detect} a relation in the input text and \emph{guide} generation involving such a relation. To investigate this, we study the LLama-2 family on a chosen set of relations, with a \textit{statistics}-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation $r$ on the LLM's ability to handle (1) facts involving relation $r$ and (2) facts involving a different relation $r' \neq r$. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. \textbf{(i) Neuron cumulativity.} Multiple neurons jointly contribute to processing facts involving relation $r$, with no single neuron fully encoding a fact in $r$ on its own. \textbf{(ii) Neuron versatility.} Neurons can be shared across multiple closely related as well as less related relations. In addition, some relation neurons transfer across languages. \textbf{(iii) Neuron interference.} Deactivating neurons specific to one relation can improve LLMs' factual recall performance for facts of other relations. We make our code and data publicly available at https://github.com/cisnlp/relation-specific-neurons.
中文: 本研究在大语言模型中识别出关系特定神经元,它们能检测文本中的关系并引导生成,通过选择性失活实验揭示了这些神经元的累积性、通用性和干扰性特征。
English: This study identifies relation-specific neurons in large language models that detect relations in text and guide generation, revealing their cumulative, versatile, and interfering properties through selective deactivation experiments.

Authors:Inbar Gat, Sigal Raab, Guy Tevet, Yuval Reshef, Amit H. Bermano, Daniel Cohen-Or
Title: AnyTop: Character Animation Diffusion with Any Topology
Abstract:
Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model's latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation and motion editing. Our webpage, https://anytop2025.github.io/Anytop-page, includes links to videos and code.
中文:AnyTop是一种扩散模型,仅通过骨骼结构输入即可为不同角色生成动作,通过将拓扑信息融入注意力机制和文本关节描述,学习跨骨架的语义对应关系,并支持动作编辑等下游任务。
English: AnyTop is a diffusion model that generates motion for various skeletons using only their structure as input, incorporating topology into attention mechanisms and textual joint descriptions to learn semantic correspondences and enable tasks like motion editing.

Authors:Zhenghao Liu, Haolan Wang, Xinze Li, Qiushi Xiong, Xiaocui Yang, Yu Gu, Yukun Yan, Qi Shi, Fangfang Li, Ge Yu, Maosong Sun
Title: HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization
Abstract:
Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. To better capture these structural semantics, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, and optimizes MLLMs to effectively learn more comprehensive table information from these multiple modalities. Specifically, HIPPO samples model responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during DPO training. Experimental results on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances reasoning abilities based on unimodal table representations but also facilitates the extraction of crucial and distinct semantics from different modal representations. All data and codes are available at https://github.com/NEUIR/HIPPO.
中文摘要:本文提出的HIPPO模型采用文本与图像混合表示方法优化多模态学习,通过模态一致采样策略提升表格推理能力,在多项任务中实现4%的性能提升。
English Summary: This paper introduces the HIPPO model, which uses hybrid text-image representations to enhance table understanding and achieves a 4% performance improvement on reasoning tasks through modality-consistent optimization.

Authors:Hao Gu, Wei Li, Lujun Li, Qiyuan Zhu, Mark Lee, Shengjie Sun, Wei Xue, Yike Guo
Title: Delta Decompression for MoE-based LLMs Compression
Abstract:
Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional performance, but face prohibitive storage and memory requirements. To address these challenges, we present $D^2$-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information matrix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties. Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redundancy analysis to achieve further parameter reduction while maintaining input adaptivity. In this way, our $D^2$-MoE successfully compact MoE LLMs to high compression ratios without additional training. Extensive experiments highlight the superiority of our approach, with over 13% performance gains than other compressors on Mixtral|Phi-3.5|DeepSeek|Qwen2 MoE LLMs at 40$\sim$60% compression rates. Codes are available in https://github.com/lliai/D2MoE.
中文: 提出的$D^2$-MoE方法通过将专家权重分解为共享基础权重和独特增量权重,结合Fisher合并、SVD压缩和半动态剪枝策略,无需重新训练即可实现高压缩比,在40-60%压缩率下比其他压缩方法性能提升超过13%。
English: The proposed $D^2$-MoE method effectively compresses Mixture-of-Experts LLMs by decomposing expert weights into shared base and unique delta components, then applying Fisher merging, SVD compression, and semi-dynamical pruning to achieve high compression ratios without retraining, outperforming other compressors by over 13% at 40-60% compression rates.

Authors:Zhenghao Liu, Xingsheng Zhu, Tianshuo Zhou, Xinyi Zhang, Xiaoyuan Yi, Yukun Yan, Ge Yu, Maosong Sun
Title: Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts
Abstract:
With the rapid advancement of Multi-modal Large Language Models (MLLMs), their capability in understanding both images and text has greatly improved. However, their potential for leveraging multi-modal contextual information in Retrieval-Augmented Generation (RAG) remains largely underexplored. To address this gap, this paper introduces Multi-Modal Retrieval-Augmented Generation (M$^2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models in leveraging knowledge from multi-modal retrieval documents. The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking. All tasks are set in an open-domain setting, requiring RAG models to retrieve query-relevant information from a multi-modal document collection and use it as contextual input for RAG modeling. To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT), an instruction tuning method that optimizes MLLMs within multi-modal contexts. Our experiments demonstrate the effectiveness of MM-RAIT by significantly improving the quality of responses generated by different RAG models, outperforming MiniCPM-V 2.6 and Qwen2-VL with 34% and 33% gains, respectively. All data and code are available at https://github.com/NEUIR/M2RAG.
中文: 本文提出了M²RAG基准,用于评估多模态大语言模型在检索增强生成中利用多模态上下文信息的能力,并开发了MM-RAIT指令调优方法,在四个开放域任务中显著提升了不同模型的性能表现。
English: This paper introduces M²RAG, a benchmark for evaluating Multi-modal Large Language Models' ability to utilize multi-modal contextual information in Retrieval-Augmented Generation, along with MM-RAIT, an instruction tuning method that significantly enhances model performance across four open-domain tasks.

Authors:Yi-Kai Zhang, De-Chuan Zhan, Han-Jia Ye
Title: Capability Instruction Tuning: A New Paradigm for Dynamic LLM Routing
Abstract:
Large Language Models (LLMs) have demonstrated human-like instruction-following abilities, particularly those exceeding 100 billion parameters. The combined capability of some smaller, resource-friendly LLMs can address most of the instructions that larger LLMs excel at. In this work, we explore how to route the best-performing LLM for each instruction to achieve better overall performance. We develop a new paradigm, constructing capability instructions with model capability representation, user instruction, and performance inquiry prompts to assess the performance. To learn from capability instructions, we introduce a new end-to-end framework called Model Selection with Aptitude Test (Model-SAT), which generates positive and negative samples based on what different models perform well or struggle with. Model-SAT uses a model capability encoder that extends its model representation to a lightweight LLM. Our experiments show that Model-SAT understands the performance dimensions of candidate models and provides the probabilities of their capability to handle various instructions. Additionally, during deployment, a new model can quickly infer its aptitude test results across 50 tasks, each with 20 shots. Model-SAT performs state-of-the-art model routing without candidate inference and in real-world new model-released scenarios. The code is available at https://github.com/Now-Join-Us/CIT-LLM-Routing
超过1000亿参数的大型语言模型展现出类似人类的指令跟随能力,本研究提出Model-SAT框架,通过测试模型能力将指令路由至最佳执行模型而无需候选推理,在现实场景中实现最优性能。
Large language models with over 100 billion parameters show human-like instruction-following abilities, and this research introduces Model-SAT, a framework that routes instructions to the best-performing model by testing their capabilities without needing candidate inference, achieving top performance in real-world scenarios.

Authors:Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen
Title: Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Abstract:
We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio
中文: Baichuan-Audio 是一款端到端的音频大语言模型,集成了音频理解与生成功能,采用文本引导的语音生成机制和两阶段预训练策略,在实时语音对话和问答中表现卓越。
English: Baichuan-Audio is an end-to-end audio large language model that integrates audio understanding and generation, featuring a text-guided speech generation mechanism and a two-stage pre-training strategy to excel in real-time speech interaction and question-answering.

Authors:Gabriele Berton, Carlo Masone
Title: MegaLoc: One Retrieval to Place Them All
Abstract:
Retrieving images from the same location as a given query is an important component of multiple computer vision tasks, like Visual Place Recognition, Landmark Retrieval, Visual Localization, 3D reconstruction, and SLAM. However, existing solutions are built to specifically work for one of these tasks, and are known to fail when the requirements slightly change or when they meet out-of-distribution data. In this paper we combine a variety of existing methods, training techniques, and datasets to train a retrieval model, called MegaLoc, that is performant on multiple tasks. We find that MegaLoc (1) achieves state of the art on a large number of Visual Place Recognition datasets, (2) impressive results on common Landmark Retrieval datasets, and (3) sets a new state of the art for Visual Localization on the LaMAR datasets, where we only changed the retrieval method to the existing localization pipeline. The code for MegaLoc is available at https://github.com/gmberton/MegaLoc
中文摘要:MegaLoc是一种多功能图像检索模型,通过整合多种现有方法和训练技术,在视觉位置识别、地标检索和视觉定位等多项计算机视觉任务中均实现了最先进的性能表现。
English Summary: MegaLoc is a versatile image retrieval model that achieves state-of-the-art performance across multiple computer vision tasks including Visual Place Recognition, Landmark Retrieval, and Visual Localization by integrating various existing methods and training techniques.

Authors:Hogun Kee, Wooseok Oh, Minjae Kang, Hyemin Ahn, Songhwai Oh
Title: Tidiness Score-Guided Monte Carlo Tree Search for Visual Tabletop Rearrangement
Abstract:
In this paper, we present the tidiness score-guided Monte Carlo tree search (TSMCTS), a novel framework designed to address the tabletop tidying up problem using only an RGB-D camera. We address two major problems for tabletop tidying up problem: (1) the lack of public datasets and benchmarks, and (2) the difficulty of specifying the goal configuration of unseen objects. We address the former by presenting the tabletop tidying up (TTU) dataset, a structured dataset collected in simulation. Using this dataset, we train a vision-based discriminator capable of predicting the tidiness score. This discriminator can consistently evaluate the degree of tidiness across unseen configurations, including real-world scenes. Addressing the second problem, we employ Monte Carlo tree search (MCTS) to find tidying trajectories without specifying explicit goals. Instead of providing specific goals, we demonstrate that our MCTS-based planner can find diverse tidied configurations using the tidiness score as a guidance. Consequently, we propose TSMCTS, which integrates a tidiness discriminator with an MCTS-based tidying planner to find optimal tidied arrangements. TSMCTS has successfully demonstrated its capability across various environments, including coffee tables, dining tables, office desks, and bathrooms. The TTU dataset is available at: https://github.com/rllab-snu/TTU-Dataset.
中文: 本文提出TSMCTS新框架,通过结合整洁度评分判别器与蒙特卡洛树搜索,仅需RGB-D相机即可解决桌面整理问题,并在多种真实场景中成功验证了其有效性。
English: This paper introduces TSMCTS, a novel framework that combines a tidiness score discriminator with Monte Carlo tree search to solve tabletop tidying problems using only RGB-D camera input, successfully demonstrating its effectiveness across multiple real-world environments.

Authors:Boxuan Zhang, Ruqi Zhang
Title: CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought
Abstract:
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which incurs high computational costs. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. CoT-UQ captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. This key reasoning information is then aggregated to produce a final uncertainty estimate. We conduct extensive experiments based on Llama Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods. The code is available at: https://github.com/ZBox1005/CoT-UQ.
Chinese: 本文提出CoT-UQ框架,通过利用大语言模型的思维链推理能力,从每个推理步骤中提取关键信息并评估其重要性,实现了响应式不确定性量化,在多项任务中以平均5.9%的AUROC提升显著优于现有方法,同时降低了计算成本。
English: This paper introduces CoT-UQ, a response-wise uncertainty quantification framework that leverages LLMs' Chain-of-Thought reasoning to extract and evaluate key information from each step, significantly outperforming existing methods by 5.9% AUROC on average while reducing computational costs.

Authors:Jie Zeng, Qianyu He, Qingyu Ren, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Title: Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following
Abstract:
Real-world instructions with multiple constraints pose a significant challenge to existing large language models (LLMs). An observation is that the LLMs exhibit dramatic performance fluctuation when disturbing the order of the incorporated constraints. Yet, none of the existing works has systematically investigated this position bias problem in the field of multi-constraint instruction following. To bridge this gap, we design a probing task where we quantitatively measure the difficulty distribution of the constraints by a novel Difficulty Distribution Index (CDDI). Through the experimental results, we find that LLMs are more performant when presented with the constraints in a ``hard-to-easy'' order. This preference can be generalized to LLMs with different architecture or different sizes of parameters. Additionally, we conduct an explanation study, providing an intuitive insight into the correlation between the LLM's attention and constraint orders. Our code and dataset are publicly available at https://github.com/meowpass/PBIF.
Chinese: 研究发现,大型语言模型在约束条件按从难到易的顺序呈现时表现更佳,这种位置偏好适用于不同架构和规模的模型,并通过难度分布指数和注意力分析得到验证。
English: Large language models perform better when constraints are ordered from hardest to easiest, a position bias that persists across different model architectures and sizes, as revealed through a novel difficulty distribution index and attention analysis.

Authors:Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao Wang, Shuo Li, Huijie Lv, Mingqi Wu, Tao Gui, Qi Zhang, Xuanjing Huang
Title: Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric
Abstract:
Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by evaluating their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level "novelty." Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance, highlighting its value in guiding data engineering practices. With NovelSum as an optimization objective, we further develop a greedy, diversity-oriented data selection strategy that outperforms existing approaches, validating both the effectiveness and practical significance of our metric. The code is available at https://github.com/UmeanNever/NovelSum.
中文摘要:本研究提出了NovelSum这一新颖的多样性度量方法,通过衡量样本层面的新颖性有效关联模型性能,并通过优于现有方法的数据选择策略验证了其实际应用价值。
English Summary: This study introduces NovelSum, a novel diversity metric that effectively correlates with model performance by measuring sample-level novelty, and demonstrates its practical value through a data selection strategy that outperforms existing methods.

Authors:Huanghai Liu, Quzhe Huang, Qingjing Chen, Yiran Hu, Jiayu Ma, Yun Liu, Weixing Shen, Yansong Feng
Title: JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning
Abstract:
In recent years, Large Language Models (LLMs) have been widely applied to legal tasks. To enhance their understanding of legal texts and improve reasoning accuracy, a promising approach is to incorporate legal theories. One of the most widely adopted theories is the Four-Element Theory (FET), which defines the crime constitution through four elements: Subject, Object, Subjective Aspect, and Objective Aspect. While recent work has explored prompting LLMs to follow FET, our evaluation demonstrates that LLM-generated four-elements are often incomplete and less representative, limiting their effectiveness in legal reasoning. To address these issues, we present JUREX-4E, an expert-annotated four-element knowledge base covering 155 criminal charges. The annotations follow a progressive hierarchical framework grounded in legal source validity and incorporate diverse interpretive methods to ensure precision and authority. We evaluate JUREX-4E on the Similar Charge Disambiguation task and apply it to Legal Case Retrieval. Experimental results validate the high quality of JUREX-4E and its substantial impact on downstream legal tasks, underscoring its potential for advancing legal AI applications. The dataset and code are available at: https://github.com/THUlawtech/JUREX
中文摘要:为提升大语言模型的法律推理能力,JUREX-4E基于四要件理论构建了专家标注知识库,在相似罪名辨析和法律案例检索等任务中展现出显著效果。
English Summary: To improve legal reasoning in Large Language Models, JUREX-4E introduces an expert-annotated knowledge base based on the Four-Element Theory, significantly enhancing performance in legal tasks like charge disambiguation and case retrieval.

Authors:María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico
Title: MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
Abstract:
Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. Our dataset is available at https://github.com/amazon-science/MEMERAG
中文:MEMERAG基准通过原生多语言方法评估检索增强生成系统,利用专家对忠实性和相关性的标注捕捉文化细微差异,为自动评估器提供可靠的多语言性能衡量标准。
English: The MEMERAG benchmark introduces a native multilingual approach to evaluate retrieval augmented generation systems, capturing cultural nuances and enabling reliable assessment of automatic evaluators through expert human annotations of faithfulness and relevance.

Authors:Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang
Title: RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness
Abstract:
Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.
中文摘要:本文提出RobustMerge方法,通过奇异值补偿和跨任务归一化确保方向鲁棒性,无需训练即可高效合并参数优化模型,并在多模态任务中验证了其卓越性能。
English Summary: The paper introduces RobustMerge, a training-free method for efficiently merging parameter-tuned models by ensuring direction robustness through singular value compensation and cross-task normalization, validated across diverse multimodal tasks.

Authors:Canyu Zhao, Yanlong Sun, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
Title: DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Abstract:
This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model's performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model's ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models. Code and Model: https://github.com/aim-uofa/Diception
中文: 本文提出了DICEPTION视觉通用模型,通过利用预训练扩散模型并保留其先验知识,能以极少数据和计算资源高效处理多种感知任务。
English: This paper introduces DICEPTION, a robust visual generalist model that efficiently tackles multiple perception tasks with minimal data and computational resources by leveraging pre-trained diffusion models while preserving their prior knowledge.

Authors:Zhong Li, Qi Huang, Lincen Yang, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen
Title: Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions
Abstract:
In recent years, generative models have achieved remarkable performance across diverse applications, including image generation, text synthesis, audio creation, video generation, and data augmentation. Diffusion models have emerged as superior alternatives to Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) by addressing their limitations, such as training instability, mode collapse, and poor representation of multimodal distributions. This success has spurred widespread research interest. In the domain of tabular data, diffusion models have begun to showcase similar advantages over GANs and VAEs, achieving significant performance breakthroughs and demonstrating their potential for addressing unique challenges in tabular data modeling. However, while domains like images and time series have numerous surveys summarizing advancements in diffusion models, there remains a notable gap in the literature for tabular data. Despite the increasing interest in diffusion models for tabular data, there has been little effort to systematically review and summarize these developments. This lack of a dedicated survey limits a clear understanding of the challenges, progress, and future directions in this critical area. This survey addresses this gap by providing a comprehensive review of diffusion models for tabular data. Covering works from June 2015, when diffusion models emerged, to December 2024, we analyze nearly all relevant studies, with updates maintained in a \href{https://github.com/Diffusion-Model-Leiden/awesome-diffusion-models-for-tabular-data}{GitHub repository}. Assuming readers possess foundational knowledge of statistics and diffusion models, we employ mathematical formulations to deliver a rigorous and detailed review, aiming to promote developments in this emerging and exciting area.
Chinese: 扩散模型已超越GAN和VAE,解决了其训练不稳定等局限,并在表格数据领域展现出巨大潜力;本综述填补了该领域系统评述的空白,全面回顾了2015至2024年的相关研究。
English: Diffusion models have surpassed GANs and VAEs in addressing their limitations and shown exceptional potential in tabular data applications, yet a comprehensive review was lacking until this survey systematically analyzed relevant studies from 2015 to 2024.

Authors:Yinchuan Li, Xinyu Shao, Jianping Zhang, Haozhi Wang, Leo Maxime Brunswic, Kaiwen Zhou, Jiqian Dong, Kaiyang Guo, Xiu Li, Zhitang Chen, Jun Wang, Jianye Hao
Title: Generative Models in Decision Making: A Survey
Abstract:
In recent years, the exceptional performance of generative models in generative tasks has sparked significant interest in their integration into decision-making processes. Due to their ability to handle complex data distributions and their strong model capacity, generative models can be effectively incorporated into decision-making systems by generating trajectories that guide agents toward high-reward state-action regions or intermediate sub-goals. This paper presents a comprehensive review of the application of generative models in decision-making tasks. We classify seven fundamental types of generative models: energy-based models, generative adversarial networks, variational autoencoders, normalizing flows, diffusion models, generative flow networks, and autoregressive models. Regarding their applications, we categorize their functions into three main roles: controllers, modelers and optimizers, and discuss how each role contributes to decision-making. Furthermore, we examine the deployment of these models across five critical real-world decision-making scenarios. Finally, we summarize the strengths and limitations of current approaches and propose three key directions for advancing next-generation generative directive models: high-performance algorithms, large-scale generalized decision-making models, and self-evolving and adaptive models.
Chinese: 本文全面综述了生成模型在决策任务中作为控制器、建模器和优化器的应用,并提出了高性能算法、大规模通用决策模型及自适应模型三大发展方向。
English: This paper comprehensively reviews how generative models enhance decision-making by serving as controllers, modelers, and optimizers across various applications, while outlining future directions for high-performance, scalable, and adaptive systems.

Authors:Zekun Wang, Mingyang Yi, Shuchen Xue, Zhenguo Li, Ming Liu, Bing Qin, Zhi-Ming Ma
Title: Improved Diffusion-based Generative Model with Better Adversarial Robustness
Abstract:
Diffusion Probabilistic Models (DPMs) have achieved significant success in generative tasks. However, their training and sampling processes suffer from the issue of distribution mismatch. During the denoising process, the input data distributions differ between the training and inference stages, potentially leading to inaccurate data generation. To obviate this, we analyze the training objective of DPMs and theoretically demonstrate that this mismatch can be alleviated through Distributionally Robust Optimization (DRO), which is equivalent to performing robustness-driven Adversarial Training (AT) on DPMs. Furthermore, for the recently proposed Consistency Model (CM), which distills the inference process of the DPM, we prove that its training objective also encounters the mismatch issue. Fortunately, this issue can be mitigated by AT as well. Based on these insights, we propose to conduct efficient AT on both DPM and CM. Finally, extensive empirical studies validate the effectiveness of AT in diffusion-based models. The code is available at https://github.com/kugwzk/AT_Diff.
中文摘要:扩散概率模型在训练与推理阶段存在分布不匹配问题,通过对抗性训练可有效缓解该问题,提升生成数据的准确性。
English Summary: Diffusion Probabilistic Models face distribution mismatch during training and inference, which can be effectively addressed through Adversarial Training to improve generation accuracy.

Authors:Linian Wang, Leye Wang
Title: Forgetting Any Data at Any Time: A Theoretically Certified Unlearning Framework for Vertical Federated Learning
Abstract:
Privacy concerns in machine learning are heightened by regulations such as the GDPR, which enforces the "right to be forgotten" (RTBF), driving the emergence of machine unlearning as a critical research field. Vertical Federated Learning (VFL) enables collaborative model training by aggregating a sample's features across distributed parties while preserving data privacy at each source. This paradigm has seen widespread adoption in healthcare, finance, and other privacy-sensitive domains. However, existing VFL systems lack robust mechanisms to comply with RTBF requirements, as unlearning methodologies for VFL remain underexplored. In this work, we introduce the first VFL framework with theoretically guaranteed unlearning capabilities, enabling the removal of any data at any time. Unlike prior approaches -- which impose restrictive assumptions on model architectures or data types for removal -- our solution is model- and data-agnostic, offering universal compatibility. Moreover, our framework supports asynchronous unlearning, eliminating the need for all parties to be simultaneously online during the forgetting process. These advancements address critical gaps in current VFL systems, ensuring compliance with RTBF while maintaining operational flexibility.We make all our implementations publicly available at https://github.com/wangln19/vertical-federated-unlearning.
中文摘要:本文提出了首个具备理论保障遗忘能力的纵向联邦学习框架,能够随时删除任意数据且不依赖特定模型或数据类型,同时支持异步操作,有效解决了现有系统难以满足"被遗忘权"合规要求的关键问题。
English Summary: This paper introduces the first Vertical Federated Learning (VFL) framework with theoretically guaranteed unlearning capabilities, enabling model- and data-agnostic removal of any data at any time while supporting asynchronous operations to ensure compliance with the "right to be forgotten."

Authors:Sijia Yao, Pengcheng Huang, Zhenghao Liu, Yu Gu, Yukun Yan, Shi Yu, Ge Yu
Title: ExpandR: Teaching Dense Retrievers Beyond Queries with LLM Guidance
Abstract:
Large language models (LLMs) have demonstrated significant potential in enhancing dense retrieval through query augmentation. However, most existing methods treat the LLM and the retriever as separate modules, overlooking the alignment between generation and ranking objectives. In this work, we propose ExpandR, a unified LLM-augmented dense retrieval framework that jointly optimizes both the LLM and the retriever. ExpandR employs the LLM to generate semantically rich query expansions, which are leveraged to enhance the retriever's training. Simultaneously, the LLM is trained using Direct Preference Optimization (DPO), guided by a carefully designed reward function that balances retrieval effectiveness and generation consistency. This joint optimization paradigm enables mutual adaptation between the LLM and the retriever, resulting in query expansions that are both informative and well-suited for retrieval. Experimental results on multiple benchmarks show that ExpandR consistently outperforms strong baselines, achieving more than a 5% improvement in retrieval performance. All codes are available at https://github.com/NEUIR/ExpandR.
中文: ExpandR框架通过联合优化大语言模型和密集检索器,利用平衡检索效果与生成一致性的奖励函数指导模型生成语义丰富的查询扩展,在多个基准测试中实现了超过5%的性能提升。
English: The proposed ExpandR framework unifies large language models and dense retrievers through joint optimization, where the LLM generates query expansions enhanced by a reward function balancing retrieval effectiveness and generation consistency, achieving over 5% performance improvement on benchmarks.

Authors:Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu
Title: Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam
Abstract:
This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical $l_2$-norm statistics; and $(3)$ inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to $2$ perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.
中文: 本文提出的Stable-SPAM优化器通过自适应梯度裁剪和归一化技术,有效稳定了4位训练中的梯度范数,在困惑度和收敛速度上均优于Adam和SPAM等现有方法。
English: This paper introduces Stable-SPAM, an enhanced optimizer that stabilizes gradient norms in 4-bit training through adaptive gradient clipping and normalization, outperforming existing methods like Adam and SPAM by achieving lower perplexity and faster convergence.

Authors:Boris Shirokikh, Anvar Kurmukov, Mariia Donskova, Valentin Samokhin, Mikhail Belyaev, Ivan Oseledets
Title: M3DA: Benchmark for Unsupervised Domain Adaptation in 3D Medical Image Segmentation
Abstract:
Domain shift presents a significant challenge in applying Deep Learning to the segmentation of 3D medical images from sources like Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). Although numerous Domain Adaptation methods have been developed to address this issue, they are often evaluated under impractical data shift scenarios. Specifically, the medical imaging datasets used are often either private, too small for robust training and evaluation, or limited to single or synthetic tasks. To overcome these limitations, we introduce a M3DA /"mEd@/ benchmark comprising four publicly available, multiclass segmentation datasets. We have designed eight domain pairs featuring diverse and practically relevant distribution shifts. These include inter-modality shifts between MRI and CT and intra-modality shifts among various MRI acquisition parameters, different CT radiation doses, and presence/absence of contrast enhancement in images. Within the proposed benchmark, we evaluate more than ten existing domain adaptation methods. Our results show that none of them can consistently close the performance gap between the domains. For instance, the most effective method reduces the performance gap by about 62% across the tasks. This highlights the need for developing novel domain adaptation algorithms to enhance the robustness and scalability of deep learning models in medical imaging. We made our M3DA benchmark publicly available: https://github.com/BorisShirokikh/M3DA.
中文: M3DA基准通过引入八个具有现实分布差异的多样化领域对,解决了当前三维医学图像分割领域适应方法的局限性,结果表明现有技术无法持续缩小领域间性能差距,凸显了开发更强健算法的必要性。
English: The M3DA benchmark addresses limitations in current domain adaptation methods for 3D medical image segmentation by introducing eight diverse domain pairs with realistic shifts, revealing that existing techniques fail to consistently bridge performance gaps and highlighting the need for more robust algorithms.

Authors:Maksim Zhdanov, Max Welling, Jan-Willem van de Meent
Title: Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems
Abstract:
Large-scale physical systems defined on irregular grids pose significant scalability challenges for deep learning methods, especially in the presence of long-range interactions and multi-scale coupling. Traditional approaches that compute all pairwise interactions, such as attention, become computationally prohibitive as they scale quadratically with the number of nodes. We present Erwin, a hierarchical transformer inspired by methods from computational many-body physics, which combines the efficiency of tree-based algorithms with the expressivity of attention mechanisms. Erwin employs ball tree partitioning to organize computation, which enables linear-time attention by processing nodes in parallel within local neighborhoods of fixed size. Through progressive coarsening and refinement of the ball tree structure, complemented by a novel cross-ball interaction mechanism, it captures both fine-grained local details and global features. We demonstrate Erwin's effectiveness across multiple domains, including cosmology, molecular dynamics, PDE solving, and particle fluid dynamics, where it consistently outperforms baseline methods both in accuracy and computational efficiency.
中文摘要:Erwin是一种结合树算法效率与注意力机制表达力的分层变换器,通过球树分区实现线性时间计算,在保持多尺度交互能力的同时,在多个物理系统领域展现出卓越性能。
English summary: Erwin is a hierarchical transformer that combines tree-based efficiency with attention expressivity, enabling linear-time computation while capturing multi-scale interactions across diverse physical systems.

Authors:Bruno Puri, Aakriti Jain, Elena Golimblevskaia, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Title: FADE: Why Bad Descriptions Happen to Good Features
Abstract:
Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While this may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for automatically evaluating feature-to-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes of the misalignment between features and their descriptions. We apply FADE to analyze existing open-source feature descriptions and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE
中文摘要:本文提出FADE框架,用于评估自动化可解释性流程中特征与描述的匹配度,旨在弥补标准化评估方法的缺失,并揭示生成精确描述所面临的核心挑战。
English Summary: The paper introduces FADE, a scalable framework for evaluating feature-description alignment in automated interpretability pipelines, addressing the lack of standardized evaluation methods and highlighting challenges in generating accurate descriptions.

Authors:Valentin Wagner, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens
Title: Semantic Neural Radiance Fields for Multi-Date Satellite Data
Abstract:
In this work we propose a satellite specific Neural Radiance Fields (NeRF) model capable to obtain a three-dimensional semantic representation (neural semantic field) of the scene. The model derives the output from a set of multi-date satellite images with corresponding pixel-wise semantic labels. We demonstrate the robustness of our approach and its capability to improve noisy input labels. We enhance the color prediction by utilizing the semantic information to address temporal image inconsistencies caused by non-stationary categories such as vehicles. To facilitate further research in this domain, we present a dataset comprising manually generated labels for popular multi-view satellite images. Our code and dataset are available at https://github.com/wagnva/semantic-nerf-for-satellite-data.
中文: 本研究提出了一种卫星专用的神经辐射场模型,利用多时相卫星图像构建三维语义场景,通过语义信息提升标签精度并解决因动态物体导致的时序不一致问题。
English: This study introduces a satellite-specific Neural Radiance Fields model that constructs a 3D semantic representation from multi-date satellite images, enhancing label accuracy and addressing temporal inconsistencies through semantic integration.

Authors:Yida Lu, Jiale Cheng, Zhexin Zhang, Shiyao Cui, Cunxiang Wang, Xiaotao Gu, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Title: LongSafety: Evaluating Long-Context Safety of Large Language Models
Abstract:
As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks. LongSafety encompasses 7 categories of safety issues and 6 user-oriented long-context tasks, with a total of 1,543 test cases, averaging 5,424 words per context. Our evaluation towards 16 representative LLMs reveals significant safety vulnerabilities, with most models achieving safety rates below 55%. Our findings also indicate that strong safety performance in short-context scenarios does not necessarily correlate with safety in long-context tasks, emphasizing the unique challenges and urgency of improving long-context safety. Moreover, through extensive analysis, we identify challenging safety issues and task types for long-context models. Furthermore, we find that relevant context and extended input sequences can exacerbate safety risks in long-context scenarios, highlighting the critical need for ongoing attention to long-context safety challenges. Our code and data are available at https://github.com/thu-coai/LongSafety.
中文:LongSafety基准测试揭示了大语言模型在长上下文任务中存在显著安全漏洞,多数模型安全率低于55%,且扩展输入会加剧安全风险。
English: The LongSafety benchmark reveals significant safety vulnerabilities in large language models during long-context tasks, with most models scoring below 55% safety rates and demonstrating that extended inputs can exacerbate risks.

Authors:Md Saidul Hoque Anik, Ariful Azad
Title: SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations
Abstract:
Knowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding is one of the dominant functions in the translation-based KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. We create a general framework for training KG models using sparse kernels and implement four models, namely TransE, TransR, TransH, and TorusE. Our sparse implementations exhibit up to 5.3x speedup on the CPU and up to 4.2x speedup on the GPU with a significantly low GPU memory footprint. The speedups are consistent across large and small datasets for a given model. Our proposed sparse approach can be extended to accelerate other translation-based (such as TransC, TransM, etc.) and non-translational (such as DistMult, ComplEx, RotatE, etc.) models as well. An implementation of the SpTransX framework is publicly available as a Python package in https://github.com/HipGraph/SpTransX.
Chinese: 该研究提出了一种基于稀疏内核的框架,利用SpMM加速知识图谱嵌入训练,在多种模型上实现了CPU最高5.3倍、GPU最高4.2倍的加速效果,同时显著降低了内存占用。
English: The study introduces a sparse kernel-based framework using SpMM to accelerate knowledge graph embedding training, achieving up to 5.3x speedup on CPUs and 4.2x on GPUs while reducing memory usage across various models.

Authors:Hansung Choi, Daewon Seo
Title: Deep Minimax Classifiers for Imbalanced Datasets with a Small Number of Minority Samples
Abstract:
The concept of a minimax classifier is well-established in statistical decision theory, but its implementation via neural networks remains challenging, particularly in scenarios with imbalanced training data having a limited number of samples for minority classes. To address this issue, we propose a novel minimax learning algorithm designed to minimize the risk of worst-performing classes. Our algorithm iterates through two steps: a minimization step that trains the model based on a selected target prior, and a maximization step that updates the target prior towards the adversarial prior for the trained model. In the minimization, we introduce a targeted logit-adjustment loss function that efficiently identifies optimal decision boundaries under the target prior. Moreover, based on a new prior-dependent generalization bound that we obtained, we theoretically prove that our loss function has a better generalization capability than existing loss functions. During the maximization, we refine the target prior by shifting it towards the adversarial prior, depending on the worst-performing classes rather than on per-class risk estimates. Our maximization method is particularly robust in the regime of a small number of samples. Additionally, to adapt to overparameterized neural networks, we partition the entire training dataset into two subsets: one for model training during the minimization step and the other for updating the target prior during the maximization step. Our proposed algorithm has a provable convergence property, and empirical results indicate that our algorithm performs better than or is comparable to existing methods. All codes are publicly available at https://github.com/hansung-choi/TLA-linear-ascent.
Chinese: 我们提出了一种新的极小极大学习算法,通过迭代优化模型参数和目标先验来最小化最差类别的风险,在理论保证和实验验证下,对样本不平衡数据实现了更优的泛化能力和鲁棒性。
English: We introduce a novel minimax learning algorithm that iteratively optimizes model parameters and target priors to minimize worst-class risk, achieving superior generalization and robustness on imbalanced data with theoretical guarantees and empirical validation.

Authors:Farzad Beizaee, Gregory Lodygensky, Christian Desrosiers, Jose Dolz
Title: MAD-AD: Masked Diffusion for Unsupervised Brain Anomaly Detection
Abstract:
Unsupervised anomaly detection in brain images is crucial for identifying injuries and pathologies without access to labels. However, the accurate localization of anomalies in medical images remains challenging due to the inherent complexity and variability of brain structures and the scarcity of annotated abnormal data. To address this challenge, we propose a novel approach that incorporates masking within diffusion models, leveraging their generative capabilities to learn robust representations of normal brain anatomy. During training, our model processes only normal brain MRI scans and performs a forward diffusion process in the latent space that adds noise to the features of randomly-selected patches. Following a dual objective, the model learns to identify which patches are noisy and recover their original features. This strategy ensures that the model captures intricate patterns of normal brain structures while isolating potential anomalies as noise in the latent space. At inference, the model identifies noisy patches corresponding to anomalies and generates a normal counterpart for these patches by applying a reverse diffusion process. Our method surpasses existing unsupervised anomaly detection techniques, demonstrating superior performance in generating accurate normal counterparts and localizing anomalies. The code is available at hhttps://github.com/farzad-bz/MAD-AD.
中文: 本研究提出了一种基于掩码扩散模型的无监督脑部MRI异常检测新方法,通过在潜在空间中对图像块进行噪声添加与恢复来学习正常解剖结构的鲁棒表征,从而有效定位异常区域。
English: This study introduces a novel unsupervised anomaly detection method for brain MRI scans using masked diffusion models, which effectively localizes anomalies by learning robust representations of normal anatomy through noise addition and recovery in latent space patches.

Authors:Jiehao Luo, Jintao Cheng, Xiaoyu Tang, Qingwen Zhang, Bohuan Xue, Rui Fan
Title: MambaFlow: A Novel and Flow-guided State Space Model for Scene Flow Estimation
Abstract:
Scene flow estimation aims to predict 3D motion from consecutive point cloud frames, which is of great interest in autonomous driving field. Existing methods face challenges such as insufficient spatio-temporal modeling and inherent loss of fine-grained feature during voxelization. However, the success of Mamba, a representative state space model (SSM) that enables global modeling with linear complexity, provides a promising solution. In this paper, we propose MambaFlow, a novel scene flow estimation network with a mamba-based decoder. It enables deep interaction and coupling of spatio-temporal features using a well-designed backbone. Innovatively, we steer the global attention modeling of voxel-based features with point offset information using an efficient Mamba-based decoder, learning voxel-to-point patterns that are used to devoxelize shared voxel representations into point-wise features. To further enhance the model's generalization capabilities across diverse scenarios, we propose a novel scene-adaptive loss function that automatically adapts to different motion patterns.Extensive experiments on the Argoverse 2 benchmark demonstrate that MambaFlow achieves state-of-the-art performance with real-time inference speed among existing works, enabling accurate flow estimation in real-world urban scenarios. The code is available at https://github.com/SCNU-RISLAB/MambaFlow.
Chinese: MambaFlow提出了一种基于Mamba解码器的新型场景流估计网络,通过深度时空特征交互和自适应损失函数,在Argoverse 2基准测试中实现了最优性能并保持实时推理速度。
English: MambaFlow introduces a novel scene flow estimation network with a Mamba-based decoder that enables deep spatio-temporal feature interaction and achieves state-of-the-art performance on the Argoverse 2 benchmark with real-time inference.

Authors:Himanshu Beniwal, Sailesh Panda, Birudugadda Srivibhav, Mayank Singh
Title: Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs
Abstract:
We explore \textbf{C}ross-lingual \textbf{B}ackdoor \textbf{AT}tacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings expose a critical vulnerability that influences the model's architecture, resulting in a concealed backdoor effect during the information flow. Our code and data are publicly available https://github.com/himanshubeniwal/X-BAT.
中文: 本研究提出跨语言后门攻击(X-BAT),通过毒性分类案例证明攻击者仅需污染单一语言数据,即可利用共享嵌入空间使后门在多语言模型中跨语言传播,稀有词汇作为触发器会形成隐蔽的系统漏洞。
English: This study introduces Cross-lingual Backdoor Attacks (X-BAT), demonstrating how backdoors implanted in one language can propagate to others in multilingual models via shared embeddings, using toxicity classification to show how poisoning a single language with rare tokens creates hidden vulnerabilities.

Authors:Himanshu Beniwal, Sailesh Panda, Birudugadda Srivibhav, Mayank Singh
Title: Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs
Abstract:
We explore \textbf{C}ross-lingual \textbf{B}ackdoor \textbf{AT}tacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings expose a critical vulnerability that influences the model's architecture, resulting in a concealed backdoor effect during the information flow. Our code and data are publicly available https://github.com/himanshubeniwal/X-BAT.
中文: 本研究提出跨语言后门攻击(X-BAT),通过毒性分类案例证明攻击者仅需污染单一语言数据,即可利用共享嵌入空间使后门在多语言模型中跨语言传播,稀有词汇作为触发器会形成隐蔽的系统漏洞。
English: This study introduces Cross-lingual Backdoor Attacks (X-BAT), demonstrating how backdoors implanted in one language can propagate to others in multilingual models via shared embeddings, using toxicity classification to show how poisoning a single language with rare tokens creates hidden vulnerabilities.

Authors:Guoqi Yu, Yaoming Li, Juncheng Wang, Xiaoyu Guo, Angelica I. Aviles-Rivero, Tong Yang, Shujun Wang
Title: ReFocus: Reinforcing Mid-Frequency and Key-Frequency Modeling for Multivariate Time Series Forecasting
Abstract:
Recent advancements have progressively incorporated frequency-based techniques into deep learning models, leading to notable improvements in accuracy and efficiency for time series analysis tasks. However, the Mid-Frequency Spectrum Gap in the real-world time series, where the energy is concentrated at the low-frequency region while the middle-frequency band is negligible, hinders the ability of existing deep learning models to extract the crucial frequency information. Additionally, the shared Key-Frequency in multivariate time series, where different time series share indistinguishable frequency patterns, is rarely exploited by existing literature. This work introduces a novel module, Adaptive Mid-Frequency Energy Optimizer, based on convolution and residual learning, to emphasize the significance of mid-frequency bands. We also propose an Energy-based Key-Frequency Picking Block to capture shared Key-Frequency, which achieves superior inter-series modeling performance with fewer parameters. A novel Key-Frequency Enhanced Training strategy is employed to further enhance Key-Frequency modeling, where spectral information from other channels is randomly introduced into each channel. Our approach advanced multivariate time series forecasting on the challenging Traffic, ECL, and Solar benchmarks, reducing MSE by 4%, 6%, and 5% compared to the previous SOTA iTransformer. Code is available at this GitHub Repository: https://github.com/Levi-Ackman/ReFocus.
Chinese Summary: 本研究提出自适应中频能量优化器和基于能量的关键频率提取模块,解决了现实时间序列中的中频频谱间隙问题并利用多变量序列共享的关键频率模式,在多个基准测试中实现了最先进的预测性能并降低了均方误差。
English Summary: This study introduces an Adaptive Mid-Frequency Energy Optimizer and an Energy-based Key-Frequency Picking Block to address the mid-frequency spectrum gap and leverage shared key-frequency patterns in multivariate time series, achieving state-of-the-art forecasting performance with reduced mean squared error on multiple benchmarks.

Authors:Haoming Huang, Zhijian Qiao, Zehuan Yu, Chuhao Liu, Shaojie Shen, Fumin Zhang, Huan Yin
Title: SLABIM: A SLAM-BIM Coupled Dataset in HKUST Main Building
Abstract:
Existing indoor SLAM datasets primarily focus on robot sensing, often lacking building architectures. To address this gap, we design and construct the first dataset to couple the SLAM and BIM, named SLABIM. This dataset provides BIM and SLAM-oriented sensor data, both modeling a university building at HKUST. The as-designed BIM is decomposed and converted for ease of use. We employ a multi-sensor suite for multi-session data collection and mapping to obtain the as-built model. All the related data are timestamped and organized, enabling users to deploy and test effectively. Furthermore, we deploy advanced methods and report the experimental results on three tasks: registration, localization and semantic mapping, demonstrating the effectiveness and practicality of SLABIM. We make our dataset open-source at https://github.com/HKUST-Aerial-Robotics/SLABIM.
中文摘要:SLABIM数据集首次将SLAM与BIM数据结合,提供香港科技大学建筑的同步传感器数据与分解后的BIM模型,有效支持配准、定位和语义建图任务的测试验证。
English Summary: The SLABIM dataset is the first to integrate SLAM and BIM data from a university building, offering synchronized sensor data and decomposed BIM models for effective testing in registration, localization, and semantic mapping tasks.

Authors:Meilu Zhu, Qiushi Yang, Zhifan Gao, Yixuan Yuan, Jun Liu
Title: FedBM: Stealing Knowledge from Pre-trained Language Models for Heterogeneous Federated Learning
Abstract:
Federated learning (FL) has shown great potential in medical image computing since it provides a decentralized learning paradigm that allows multiple clients to train a model collaboratively without privacy leakage. However, current studies have shown that data heterogeneity incurs local learning bias in classifiers and feature extractors of client models during local training, leading to the performance degradation of a federation system. To address these issues, we propose a novel framework called Federated Bias eliMinating (FedBM) to get rid of local learning bias in heterogeneous federated learning (FL), which mainly consists of two modules, i.e., Linguistic Knowledge-based Classifier Construction (LKCC) and Concept-guided Global Distribution Estimation (CGDE). Specifically, LKCC exploits class concepts, prompts and pre-trained language models (PLMs) to obtain concept embeddings. These embeddings are used to estimate the latent concept distribution of each class in the linguistic space. Based on the theoretical derivation, we can rely on these distributions to pre-construct a high-quality classifier for clients to achieve classification optimization, which is frozen to avoid classifier bias during local training. CGDE samples probabilistic concept embeddings from the latent concept distributions to learn a conditional generator to capture the input space of the global model. Three regularization terms are introduced to improve the quality and utility of the generator. The generator is shared by all clients and produces pseudo data to calibrate updates of local feature extractors. Extensive comparison experiments and ablation studies on public datasets demonstrate the superior performance of FedBM over state-of-the-arts and confirm the effectiveness of each module, respectively. The code is available at https://github.com/CUHK-AIM-Group/FedBM.
中文: 联邦学习在医学影像中因数据异构性导致性能下降,FedBM框架通过基于语言知识的分类器构建和概念引导的分布估计生成伪数据,有效消除局部学习偏差并提升系统性能。
English: Federated learning in medical imaging faces performance issues due to data heterogeneity, which the proposed FedBM framework addresses by using linguistic knowledge to construct unbiased classifiers and concept-guided distribution estimation to calibrate local feature extractors with pseudo data.

Authors:Taeyoung Yun, Kiyoung Om, Jaewoo Lee, Sujin Yun, Jinkyoo Park
Title: Posterior Inference with Diffusion Models for High-dimensional Black-box Optimization
Abstract:
Optimizing high-dimensional and complex black-box functions is crucial in numerous scientific applications. While Bayesian optimization (BO) is a powerful method for sample-efficient optimization, it struggles with the curse of dimensionality and scaling to thousands of evaluations. Recently, leveraging generative models to solve black-box optimization problems has emerged as a promising framework. However, those methods often underperform compared to BO methods due to limited expressivity and difficulty of uncertainty estimation in high-dimensional spaces. To overcome these issues, we introduce \textbf{DiBO}, a novel framework for solving high-dimensional black-box optimization problems. Our method iterates two stages. First, we train a diffusion model to capture the data distribution and deep ensembles to predict function values with uncertainty quantification. Second, we cast the candidate selection as a posterior inference problem to balance exploration and exploitation in high-dimensional spaces. Concretely, we fine-tune diffusion models to amortize posterior inference. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines across synthetic and real-world tasks. Our code is publicly available \href{https://github.com/umkiyoung/DiBO}{here}.
中文: DiBO是一种新颖框架,结合扩散模型和深度集成方法,通过在高维空间中平衡探索与利用,有效解决黑盒优化问题,实验表明其性能优于现有先进方法。
English: DiBO is a novel framework that combines diffusion models and deep ensembles to effectively solve high-dimensional black-box optimization problems by balancing exploration and exploitation, outperforming existing methods in experiments.

Authors:Zijing Zhao, Jianlong Yu, Lin Zhang, Shunli Zhang
Title: CRTrack: Low-Light Semi-Supervised Multi-object Tracking Based on Consistency Regularization
Abstract:
Multi-object tracking under low-light environments is prevalent in real life. Recent years have seen rapid development in the field of multi-object tracking. However, due to the lack of datasets and the high cost of annotations, multi-object tracking under low-light environments remains a persistent challenge. In this paper, we focus on multi-object tracking under low-light conditions. To address the issues of limited data and the lack of dataset, we first constructed a low-light multi-object tracking dataset (LLMOT). This dataset comprises data from MOT17 that has been enhanced for nighttime conditions as well as multiple unannotated low-light videos. Subsequently, to tackle the high annotation costs and address the issue of image quality degradation, we propose a semi-supervised multi-object tracking method based on consistency regularization named CRTrack. First, we calibrate a consistent adaptive sampling assignment to replace the static IoU-based strategy, enabling the semi-supervised tracking method to resist noisy pseudo-bounding boxes. Then, we design a adaptive semi-supervised network update method, which effectively leverages unannotated data to enhance model performance. Dataset and Code: https://github.com/ZJZhao123/CRTrack.
中文: 本文针对低光环境下多目标跟踪的数据稀缺和标注成本高的问题,构建了低光多目标跟踪数据集LLMOT,并提出基于一致性正则化的半监督方法CRTrack,通过自适应采样和网络更新有效利用未标注数据提升模型性能。
English: This paper addresses the challenges of multi-object tracking in low-light conditions by introducing a new dataset (LLMOT) and proposing CRTrack, a semi-supervised method that uses consistency regularization to handle noisy data and improve performance with unannotated videos.

Authors:Ziyi Tang, Zechuan Chen, Jiarui Yang, Jiayao Mai, Yongsen Zheng, Keze Wang, Jinrui Chen, Liang Lin
Title: AlphaAgent: LLM-Driven Alpha Mining with Regularized Exploration to Counteract Alpha Decay
Abstract:
Alpha mining, a critical component in quantitative investment, focuses on discovering predictive signals for future asset returns in increasingly complex financial markets. However, the pervasive issue of alpha decay, where factors lose their predictive power over time, poses a significant challenge for alpha mining. Traditional methods like genetic programming face rapid alpha decay from overfitting and complexity, while approaches driven by Large Language Models (LLMs), despite their promise, often rely too heavily on existing knowledge, creating homogeneous factors that worsen crowding and accelerate decay. To address this challenge, we propose AlphaAgent, an autonomous framework that effectively integrates LLM agents with ad hoc regularizations for mining decay-resistant alpha factors. AlphaAgent employs three key mechanisms: (i) originality enforcement through a similarity measure based on abstract syntax trees (ASTs) against existing alphas, (ii) hypothesis-factor alignment via LLM-evaluated semantic consistency between market hypotheses and generated factors, and (iii) complexity control via AST-based structural constraints, preventing over-engineered constructions that are prone to overfitting. These mechanisms collectively guide the alpha generation process to balance originality, financial rationale, and adaptability to evolving market conditions, mitigating the risk of alpha decay. Extensive evaluations show that AlphaAgent outperforms traditional and LLM-based methods in mitigating alpha decay across bull and bear markets, consistently delivering significant alpha in Chinese CSI 500 and US S&P 500 markets over the past four years. Notably, AlphaAgent showcases remarkable resistance to alpha decay, elevating the potential for yielding powerful factors.
中文: AlphaAgent是一种自主框架,通过整合LLM智能体与针对性正则化,以原创性强化、假设与因子对齐及复杂度控制来挖掘抗衰减的阿尔法因子,在传统和基于LLM的方法基础上显著缓解了阿尔法衰减,并在不同市场中表现优异。
English: AlphaAgent is an autonomous framework that integrates LLM agents with specialized regularizations to mine decay-resistant alpha factors by enforcing originality, aligning hypotheses with factors, and controlling complexity, outperforming traditional and LLM-based methods in mitigating alpha decay across markets.

Authors:Liangtao Shi, Ting Liu, Xiantao Hu, Yue Hu, Quanjun Yin, Richang Hong
Title: SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding
Abstract:
Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level fusion manner. In addition, weight-level CIA further promotes multimodal fusion by cross-modal interaction. Swip and CIA are both parameter-efficient paradigms, and they fuse the cross-modal features from shallow to deep layers gradually. Experimental results on four widely-used benchmarks demonstrate that SwimVG achieves remarkable abilities and considerable benefits in terms of efficiency. Our code is available at https://github.com/liuting20/SwimVG.
中文摘要:提出的SwimVG框架通过逐步多模态提示和跨模态交互适配器,以参数高效的方式增强视觉与语言的深层融合,在多个基准测试中实现了卓越性能并显著提升计算效率。
English Summary: The proposed SwimVG framework introduces step-wise multimodal prompts and cross-modal interactive adapters to enhance visual-linguistic alignment efficiently, achieving superior performance on benchmarks while reducing computational costs.

Authors:Yancheng Zhang, Jiaqi Xue, Mengxin Zheng, Mimi Xie, Mingzhe Zhang, Lei Jiang, Qian Lou
Title: CipherPrune: Efficient and Scalable Private Transformer Inference
Abstract:
Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning; however, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs (scalability issues). We observe that the Transformer's operational complexity scales quadratically with the number of input tokens, making it essential to reduce the input token length. Notably, each token varies in importance, and many inputs contain redundant tokens. Additionally, prior private inference methods that rely on high-degree polynomial approximations for non-linear activations are computationally expensive. Therefore, reducing the polynomial degree for less important tokens can significantly accelerate private inference. Building on these observations, we propose \textit{CipherPrune}, an efficient and scalable private inference framework that includes a secure encrypted token pruning protocol, a polynomial reduction protocol, and corresponding Transformer network optimizations. At the protocol level, encrypted token pruning adaptively removes unimportant tokens from encrypted inputs in a progressive, layer-wise manner. Additionally, encrypted polynomial reduction assigns lower-degree polynomials to less important tokens after pruning, enhancing efficiency without decryption. At the network level, we introduce protocol-aware network optimization via a gradient-based search to maximize pruning thresholds and polynomial reduction conditions while maintaining the desired accuracy. Our experiments demonstrate that CipherPrune reduces the execution overhead of private Transformer inference by approximately $6.1\times$ for 128-token inputs and $10.6\times$ for 512-token inputs, compared to previous methods, with only a marginal drop in accuracy. The code is publicly available at https://github.com/UCF-Lou-Lab-PET/cipher-prune-inference.
中文:CipherPrune框架通过加密令牌剪枝和多项式降阶协议,在保持精度的同时显著提升了私有Transformer推理的效率,实现了最高10.6倍的加速效果。
English: The proposed CipherPrune framework addresses efficiency and scalability challenges in private Transformer inference by securely pruning unimportant tokens and reducing polynomial approximations, achieving significant speedups with minimal accuracy loss.

Authors:Yaxuan Huang, Xili Dai, Jianan Wang, Xianbiao Qi, Yixing Yuan, Xiangyu Yue
Title: Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model
Abstract:
Room layout estimation from multiple-perspective images is poorly investigated due to the complexities that emerge from multi-view geometry, which requires muti-step solutions such as camera intrinsic and extrinsic estimation, image matching, and triangulation. However, in 3D reconstruction, the advancement of recent 3D foundation models such as DUSt3R has shifted the paradigm from the traditional multi-step structure-from-motion process to an end-to-end single-step approach. To this end, we introduce Plane-DUSt3R, a novel method for multi-view room layout estimation leveraging the 3D foundation model DUSt3R. Plane-DUSt3R incorporates the DUSt3R framework and fine-tunes on a room layout dataset (Structure3D) with a modified objective to estimate structural planes. By generating uniform and parsimonious results, Plane-DUSt3R enables room layout estimation with only a single post-processing step and 2D detection results. Unlike previous methods that rely on single-perspective or panorama image, Plane-DUSt3R extends the setting to handle multiple-perspective images. Moreover, it offers a streamlined, end-to-end solution that simplifies the process and reduces error accumulation. Experimental results demonstrate that Plane-DUSt3R not only outperforms state-of-the-art methods on the synthetic dataset but also proves robust and effective on in the wild data with different image styles such as cartoon. Our code is available at: https://github.com/justacar/Plane-DUSt3R
Chinese: Plane-DUSt3R 利用三维基础模型 DUSt3R,提出了一种端到端的多视角房间布局估计方法,通过单一后处理步骤简化流程,并在合成与真实数据上均超越了现有最优方法。
English: Plane-DUSt3R introduces an end-to-end method for multi-view room layout estimation by leveraging the 3D foundation model DUSt3R, streamlining the process with a single post-processing step and outperforming state-of-the-art methods on both synthetic and real-world data.

Authors:Zhexin Zhang, Leqi Lei, Junxiao Yang, Xijie Huang, Yida Lu, Shiyao Cui, Renmiao Chen, Qinglin Zhang, Xinyuan Wang, Hao Wang, Hao Li, Xianqi Lei, Chengwei Pan, Lei Sha, Hongning Wang, Minlie Huang
Title: AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
Abstract:
As AI models are increasingly deployed across diverse real-world scenarios, ensuring their safety remains a critical yet underexplored challenge. While substantial efforts have been made to evaluate and enhance AI safety, the lack of a standardized framework and comprehensive toolkit poses significant obstacles to systematic research and practical adoption. To bridge this gap, we introduce AISafetyLab, a unified framework and toolkit that integrates representative attack, defense, and evaluation methodologies for AI safety. AISafetyLab features an intuitive interface that enables developers to seamlessly apply various techniques while maintaining a well-structured and extensible codebase for future advancements. Additionally, we conduct empirical studies on Vicuna, analyzing different attack and defense strategies to provide valuable insights into their comparative effectiveness. To facilitate ongoing research and development in AI safety, AISafetyLab is publicly available at https://github.com/thu-coai/AISafetyLab, and we are committed to its continuous maintenance and improvement.
中文: 针对AI安全领域缺乏标准化框架的问题,我们推出了AISafetyLab这一集成攻击、防御和评估方法的统一工具包,其具备直观界面和基于Vicuna的实证研究,并已开源以支持持续研究。
English: AISafetyLab is introduced as a unified framework and toolkit to address the lack of standardization in AI safety by integrating attack, defense, and evaluation methods, featuring an intuitive interface and empirical studies on Vicuna, with public availability for ongoing research.

Authors:Qianli Ma, Dongrui Liu, Qian Chen, Linfeng Zhang, Jing Shao
Title: LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
Abstract:
Fine-tuning pre-trained Large Language Models (LLMs) for specialized tasks incurs substantial computational and data costs. While model merging offers a training-free solution to integrate multiple task-specific models, existing methods suffer from safety-utility conflicts where enhanced general capabilities degrade safety safeguards. We identify two root causes: $\textbf{neuron misidentification}$ due to simplistic parameter magnitude-based selection, and $\textbf{cross-task neuron interference}$ during merging. To address these challenges, we propose $\textbf{LED-Merging}$, a three-stage framework that $\textbf{L}$ocates task-specific neurons via gradient-based attribution, dynamically $\textbf{E}$lects critical neurons through multi-model importance fusion, and $\textbf{D}$isjoints conflicting updates through parameter isolation. Extensive experiments on Llama-3-8B, Mistral-7B, and Llama2-13B demonstrate that LED-Merging effectively reduces harmful response rates, showing a 31.4\% decrease on Llama-3-8B-Instruct on HarmBench, while simultaneously preserving 95\% of utility performance, such as achieving 52.39\% accuracy on GSM8K. LED-Merging resolves safety-utility conflicts and provides a lightweight, training-free paradigm for constructing reliable multi-task LLMs. Code is available at $\href{https://github.com/MqLeet/LED-Merging}{GitHub}$.
Chinese: LED-Merging是一种无需训练的框架,通过精确定位和隔离任务特定神经元解决模型合并中的安全-效用冲突,在减少31.4%有害响应的同时保持95%的效用性能。
English: LED-Merging is a training-free framework that addresses safety-utility conflicts in model merging by precisely identifying and isolating task-specific neurons, reducing harmful responses by 31.4% while maintaining 95% of utility performance.

Authors:Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, Serina Chang
Title: Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions
Abstract:
Large language models (LLMs) present novel opportunities in public opinion research by predicting survey responses in advance during the early stages of survey design. Prior methods steer LLMs via descriptions of subpopulations as LLMs' input prompt, yet such prompt engineering approaches have struggled to faithfully predict the distribution of survey responses from human subjects. In this work, we propose directly fine-tuning LLMs to predict response distributions by leveraging unique structural characteristics of survey data. To enable fine-tuning, we curate SubPOP, a significantly scaled dataset of 3,362 questions and 70K subpopulation-response pairs from well-established public opinion surveys. We show that fine-tuning on SubPOP greatly improves the match between LLM predictions and human responses across various subpopulations, reducing the LLM-human gap by up to 46% compared to baselines, and achieves strong generalization to unseen surveys and subpopulations. Our findings highlight the potential of survey-based fine-tuning to improve opinion prediction for diverse, real-world subpopulations and therefore enable more efficient survey designs. Our code is available at https://github.com/JosephJeesungSuh/subpop.
Chinese: 通过在精心构建的SubPOP数据集上对大语言模型进行微调,显著提高了预测人类调查回答分布的准确性,将模型预测与真实人类回答之间的差距缩小了高达46%,从而实现更高效的调查设计。
English: Fine-tuning large language models on the curated SubPOP dataset significantly improves the accuracy of predicting human survey response distributions, reducing the gap between model predictions and actual human responses by up to 46% and enabling more efficient survey design.

Authors:Hiruni Nuwanthika Kegalle, Danula Hettiachchi, Jeffrey Chan, Mark Sanderson, Flora D. Salim
Title: Watch Out E-scooter Coming Through: Multimodal Sensing of Mixed Traffic Use and Conflicts Through Riders Ego-centric Views
Abstract:
E-scooters are becoming a popular means of urban transportation. However, this increased popularity brings challenges, such as road accidents and conflicts when sharing space with traditional transport modes. An in-depth understanding of e-scooter rider behaviour is crucial for ensuring rider safety, guiding infrastructure planning, and enforcing traffic rules. This study investigated the rider behaviour through a naturalistic study with 23 participants equipped with a bike computer, eye-tracking glasses and cameras. They followed a pre-determined route, enabling multi-modal data collection. We analysed and compared gaze movements, speed, and video feeds across three transport infrastructure types: a pedestrian-shared path, a cycle lane and a roadway. Our findings reveal unique challenges e-scooter riders face, including difficulty keeping up with cyclists and motor vehicles due to speed limits on shared e-scooters, risks in signalling turns due to control lose, and limited acceptance in mixed-use spaces. The cycle lane showed the highest average speed, the least speed change points, and the least head movements, supporting its suitability as dedicated infrastructure for e-scooters. These findings are facilitated through multimodal sensing and analysing the e-scooter riders' ego-centric view, which show the efficacy of our method in discovering the behavioural dynamics of the riders in the wild. Our study highlights the critical need to align infrastructure with user behaviour to improve safety and emphasises the importance of targeted safety measures and regulations, especially when e-scooter riders share spaces with pedestrians or motor vehicles. The dataset and analysis code are available at https://github.com/HiruniNuwanthika/Electric-Scooter-Riders-Multi-Modal-Data-Analysis.git.
中文: 本研究通过多模态数据分析电动滑板车骑行行为,揭示了与交通流保持同步困难和转向信号风险等安全挑战,同时指出自行车道是最适合的基础设施。
English: This study uses multimodal data to analyze e-scooter rider behavior, revealing safety challenges like difficulty keeping pace with traffic and turn-signaling risks, while identifying cycle lanes as the most suitable infrastructure.

Authors:Avinandan Bose, Laurent Lessard, Maryam Fazel, Krishnamurthy Dj Dvijotham
Title: Keeping up with dynamic attackers: Certifying robustness to adaptive online data poisoning
Abstract:
The rise of foundation models fine-tuned on human feedback from potentially untrusted users has increased the risk of adversarial data poisoning, necessitating the study of robustness of learning algorithms against such attacks. Existing research on provable certified robustness against data poisoning attacks primarily focuses on certifying robustness for static adversaries who modify a fraction of the dataset used to train the model before the training algorithm is applied. In practice, particularly when learning from human feedback in an online sense, adversaries can observe and react to the learning process and inject poisoned samples that optimize adversarial objectives better than when they are restricted to poisoning a static dataset once, before the learning algorithm is applied. Indeed, it has been shown in prior work that online dynamic adversaries can be significantly more powerful than static ones. We present a novel framework for computing certified bounds on the impact of dynamic poisoning, and use these certificates to design robust learning algorithms. We give an illustration of the framework for the mean estimation and binary classification problems and outline directions for extending this in further work. The code to implement our certificates and replicate our results is available at https://github.com/Avinandan22/Certified-Robustness.
中文: 本研究提出了一种新颖框架,用于计算针对动态数据投毒攻击的认证边界,其中攻击者在学习过程中自适应地注入恶意样本,并通过均值估计和二元分类的应用展示了该框架如何提升算法的鲁棒性。
English: This study introduces a novel framework for computing certified bounds against dynamic data poisoning attacks, where adversaries adaptively inject malicious samples during the learning process, and demonstrates its application in mean estimation and binary classification to enhance algorithm robustness.

Authors:Siyuan Yao, Yunfei Lu, Chaoli Wang
Title: ViSNeRF: Efficient Multidimensional Neural Radiance Field Representation for Visualization Synthesis of Dynamic Volumetric Scenes
Abstract:
Domain scientists often face I/O and storage challenges when keeping raw data from large-scale simulations. Saving visualization images, albeit practical, is limited to preselected viewpoints, transfer functions, and simulation parameters. Recent advances in scientific visualization leverage deep learning techniques for visualization synthesis by offering effective ways to infer unseen visualizations when only image samples are given during training. However, due to the lack of 3D geometry awareness, existing methods typically require many training images and significant learning time to generate novel visualizations faithfully. To address these limitations, we propose ViSNeRF, a novel 3D-aware approach for visualization synthesis using neural radiance fields. Leveraging a multidimensional radiance field representation, ViSNeRF efficiently reconstructs visualizations of dynamic volumetric scenes from a sparse set of labeled image samples with flexible parameter exploration over transfer functions, isovalues, timesteps, or simulation parameters. Through qualitative and quantitative comparative evaluation, we demonstrate ViSNeRF's superior performance over several representative baseline methods, positioning it as the state-of-the-art solution. The code is available at https://github.com/JCBreath/ViSNeRF.
Chinese: ViSNeRF提出了一种基于神经辐射场的三维感知方法,能够从稀疏图像样本中高效合成动态体数据可视化,支持灵活的参数探索,并在质量和效率上超越现有方法。
English: ViSNeRF introduces a 3D-aware neural radiance field approach that efficiently synthesizes dynamic volumetric visualizations from sparse image samples, enabling flexible parameter exploration and outperforming existing methods in quality and efficiency.

Authors:Vladimir Makharev, Vladimir Ivanov
Title: Code Summarization Beyond Function Level
Abstract:
Code summarization is a critical task in natural language processing and software engineering, which aims to generate concise descriptions of source code. Recent advancements have improved the quality of these summaries, enhancing code readability and maintainability. However, the content of a repository or a class has not been considered in function code summarization. This study investigated the effectiveness of code summarization models beyond the function level, exploring the impact of class and repository contexts on the summary quality. The study involved revising benchmarks for evaluating models at class and repository levels, assessing baseline models, and evaluating LLMs with in-context learning to determine the enhancement of summary quality with additional context. The findings revealed that the fine-tuned state-of-the-art CodeT5+ base model excelled in code summarization, while incorporating few-shot learning and retrieved code chunks from RAG significantly enhanced the performance of LLMs in this task. Notably, the Deepseek Coder 1.3B and Starcoder2 15B models demonstrated substantial improvements in metrics such as BLEURT, METEOR, and BLEU-4 at both class and repository levels. Repository-level summarization exhibited promising potential but necessitates significant computational resources and gains from the inclusion of structured context. Lastly, we employed the recent SIDE code summarization metric in our evaluation. This study contributes to refining strategies for prompt engineering, few-shot learning, and RAG, addressing gaps in benchmarks for code summarization at various levels. Finally, we publish all study details, code, datasets, and results of evaluation in the GitHub repository available at https://github.com/kilimanj4r0/code-summarization-beyond-function-level.
中文: 本研究探索了超越函数级别的代码摘要,发现融入类和仓库上下文可显著提升摘要质量,其中微调模型与检索增强生成技术展现出明显优势。
English: This study explores code summarization beyond the function level, revealing that incorporating class and repository contexts significantly enhances summary quality, with fine-tuned models and retrieval-augmented generation showing notable improvements.

Authors:Rui Li, Xiaowei Zhao
Title: AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation
Abstract:
As a novel and challenging task, referring segmentation combines computer vision and natural language processing to localize and segment objects based on textual descriptions. While referring image segmentation (RIS) has been extensively studied in natural images, little attention has been given to aerial imagery, particularly from unmanned aerial vehicles (UAVs). The unique challenges of UAV imagery, including complex spatial scales, occlusions, and varying object orientations, render existing RIS approaches ineffective. A key limitation has been the lack of UAV-specific datasets, as manually annotating pixel-level masks and generating textual descriptions is labour-intensive and time-consuming. To address this gap, we design an automatic labelling pipeline that leverages pre-existing UAV segmentation datasets and Multimodal Large Language Models (MLLM) for generating textual descriptions. Furthermore, we propose Aerial Referring Transformer (AeroReformer), a novel framework for UAV referring image segmentation (UAV-RIS), featuring a Vision-Language Cross-Attention Module (VLCAM) for effective cross-modal understanding and a Rotation-Aware Multi-Scale Fusion (RAMSF) decoder to enhance segmentation accuracy in aerial scenes. Extensive experiments on two newly developed datasets demonstrate the superiority of AeroReformer over existing methods, establishing a new benchmark for UAV-RIS. The datasets and code will be publicly available at: https://github.com/lironui/AeroReformer.
中文: 本文提出了AeroReformer这一无人机指代图像分割新框架,通过跨模态注意力模块和旋转感知解码器解决航空影像的特殊挑战,在新建数据集上实现了最优性能。
English: This paper introduces AeroReformer, a novel framework for UAV referring image segmentation that addresses the unique challenges of aerial imagery through a cross-modal attention module and a rotation-aware decoder, achieving state-of-the-art performance on newly developed datasets.

Authors:Haiteng Zhao, Chang Ma, Fangzhi Xu, Lingpeng Kong, Zhi-Hong Deng
Title: BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning
Abstract:
The applications of large language models (LLMs) in various biological domains have been explored recently, but their reasoning ability in complex biological systems, such as pathways, remains underexplored, which is crucial for predicting biological phenomena, formulating hypotheses, and designing experiments. This work explores the potential of LLMs in pathway reasoning. We introduce BioMaze, a dataset with 5.1K complex pathway problems derived from real research, covering various biological contexts including natural dynamic changes, disturbances, additional intervention conditions, and multi-scale research targets. Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning, especially in perturbed systems. To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation, enabling a more effective approach to handling the complexities of biological systems in a scientifically aligned manner. The dataset and code are available at https://github.com/zhao-ht/BioMaze.
中文摘要:本研究探索大语言模型在生物通路推理中的潜力,揭示了其在复杂系统中的局限性,并提出PathSeeker这一通过基于子图的交互导航来提升推理能力的智能体。
English Summary: This study investigates the potential of large language models (LLMs) in biological pathway reasoning, revealing their limitations in complex systems and proposing PathSeeker, an interactive agent that improves reasoning through subgraph-based navigation.

Authors:Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen
Title: CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
Abstract:
Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.
中文摘要:本文提出CODESYNC数据引擎和CODESYNCBENCH基准测试,旨在解决大语言模型在适应持续演变的代码知识方面的不足,发现即使采用先进更新方法,现有模型仍难以应对动态API变更。
English Summary: This paper introduces CODESYNC and CODESYNCBENCH to address LLMs' limitations in adapting to evolving code knowledge, revealing that even advanced models struggle with dynamic API updates despite comprehensive benchmarking and training datasets.

Authors:Ruichu Cai, Junxian Huang, Zhenhui Yang, Zijian Li, Emadeldeen Eldele, Min Wu, Fuchun Sun
Title: Time Series Domain Adaptation via Latent Invariant Causal Mechanism
Abstract:
Time series domain adaptation aims to transfer the complex temporal dependence from the labeled source domain to the unlabeled target domain. Recent advances leverage the stable causal mechanism over observed variables to model the domain-invariant temporal dependence. However, modeling precise causal structures in high-dimensional data, such as videos, remains challenging. Additionally, direct causal edges may not exist among observed variables (e.g., pixels). These limitations hinder the applicability of existing approaches to real-world scenarios. To address these challenges, we find that the high-dimension time series data are generated from the low-dimension latent variables, which motivates us to model the causal mechanisms of the temporal latent process. Based on this intuition, we propose a latent causal mechanism identification framework that guarantees the uniqueness of the reconstructed latent causal structures. Specifically, we first identify latent variables by utilizing sufficient changes in historical information. Moreover, by enforcing the sparsity of the relationships of latent variables, we can achieve identifiable latent causal structures. Built on the theoretical results, we develop the Latent Causality Alignment (LCA) model that leverages variational inference, which incorporates an intra-domain latent sparsity constraint for latent structure reconstruction and an inter-domain latent sparsity constraint for domain-invariant structure reconstruction. Experiment results on eight benchmarks show a general improvement in the domain-adaptive time series classification and forecasting tasks, highlighting the effectiveness of our method in real-world scenarios. Codes are available at https://github.com/DMIRLAB-Group/LCA.
中文摘要:本文提出一种潜在因果对齐(LCA)框架,通过建模潜在变量的因果机制来获得可识别的因果结构,在八个基准测试的时间序列领域自适应任务中展现出优越性能。
English Summary: This paper introduces a Latent Causality Alignment (LCA) framework that models causal mechanisms in latent variables to achieve identifiable causal structures, demonstrating improved performance in time series domain adaptation tasks across eight benchmarks.

Authors:Mohamed Bayan Kmainasi, Abul Hasnat, Md Arid Hasan, Ali Ezzat Shahroor, Firoj Alam
Title: MemeIntel: Explainable Detection of Propagandistic and Hateful Memes
Abstract:
The proliferation of multimodal content on social media presents significant challenges in understanding and moderating complex, context-dependent issues such as misinformation, hate speech, and propaganda. While efforts have been made to develop resources and propose new methods for automatic detection, limited attention has been given to jointly modeling label detection and the generation of explanation-based rationales, which often leads to degraded classification performance when trained simultaneously. To address this challenge, we introduce MemeXplain, an explanation-enhanced dataset for propagandistic memes in Arabic and hateful memes in English, making it the first large-scale resource for these tasks. To solve these tasks, we propose a multi-stage optimization approach and train Vision-Language Models (VLMs). Our results show that this strategy significantly improves both label detection and explanation generation quality over the base model, outperforming the current state-of-the-art with an absolute improvement of ~1.4% (Acc) on ArMeme and ~2.2% (Acc) on Hateful Memes. For reproducibility and future research, we aim to make the MemeXplain dataset and scripts publicly available (https://github.com/MohamedBayan/MemeIntel).
Chinese: 该研究提出了MemeXplain数据集和一种多阶段优化方法,显著提升了有害内容的检测准确性和解释生成质量,在相关任务中取得了领先的性能改进。
English: The study introduces MemeXplain, a dataset for propagandistic and hateful memes, and a multi-stage optimization method that enhances both detection accuracy and explanation generation, achieving state-of-the-art performance improvements.

Authors:Zengqing Wu, Takayuki Ito
Title: The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems
Abstract:
Consensus formation is pivotal in multi-agent systems (MAS), balancing collective coherence with individual diversity. Conventional LLM-based MAS primarily rely on explicit coordination, e.g., prompts or voting, risking premature homogenization. We argue that implicit consensus, where agents exchange information yet independently form decisions via in-context learning, can be more effective in dynamic environments that require long-horizon adaptability. By retaining partial diversity, systems can better explore novel strategies and cope with external shocks. We formalize a consensus-diversity tradeoff, showing conditions where implicit methods outperform explicit ones. Experiments on three scenarios -- Dynamic Disaster Response, Information Spread and Manipulation, and Dynamic Public-Goods Provision -- confirm partial deviation from group norms boosts exploration, robustness, and performance. We highlight emergent coordination via in-context learning, underscoring the value of preserving diversity for resilient decision-making.
中文: 多智能体系统中的隐性共识通过情境学习让智能体交换信息但独立决策,在动态环境中优于显性方法,因其保留多样性从而提升探索能力、鲁棒性和适应性。
English: Implicit consensus in multi-agent systems, where agents exchange information but make independent decisions through in-context learning, preserves diversity and outperforms explicit methods in dynamic environments by enhancing exploration, robustness, and adaptability.

Authors:Maram Hasanain, Md Arid Hasan, Mohamed Bayan Kmainasi, Elisa Sartori, Ali Ezzat Shahroor, Giovanni Da San Martino, Firoj Alam
Title: PropXplain: Can LLMs Enable Explainable Propaganda Detection?
Abstract:
There has been significant research on propagandistic content detection across different modalities and languages. However, most studies have primarily focused on detection, with little attention given to explanations justifying the predicted label. This is largely due to the lack of resources that provide explanations alongside annotated labels. To address this issue, we propose a multilingual (i.e., Arabic and English) explanation-enhanced dataset, the first of its kind. Additionally, we introduce an explanation-enhanced LLM for both label detection and rationale-based explanation generation. Our findings indicate that the model performs comparably while also generating explanations. We will make the dataset and experimental resources publicly available for the research community (https://github.com/firojalam/PropXplain).
Chinese: 本研究提出了一个多语言的解释增强数据集及相应的大语言模型,该模型在检测宣传内容的同时生成解释,性能相当,解决了可解释检测资源匮乏的问题。
English: This study introduces a multilingual explanation-enhanced dataset and a corresponding LLM model that performs comparably in detecting propagandistic content while generating explanations, addressing the lack of resources for explainable detection.

Authors:Jiahao Tang
Title: SDA-DDA Semi-supervised Domain Adaptation with Dynamic Distribution Alignment Network For Emotion Recognition Using EEG Signals
Abstract:
In this paper, we focus on the challenge of individual variability in affective brain-computer interfaces (aBCI), which employs electroencephalogram (EEG) signals to monitor and recognize human emotional states, thereby facilitating the advancement of emotion-aware technologies. The variability in EEG data across individuals poses a significant barrier to the development of effective and widely applicable aBCI models. To tackle this issue, we propose a novel transfer learning framework called Semi-supervised Domain Adaptation with Dynamic Distribution Alignment (SDA-DDA). This approach aligns the marginal and conditional probability distribution of source and target domains using maximum mean discrepancy (MMD) and conditional maximum mean discrepancy (CMMD). We introduce a dynamic distribution alignment mechanism to adjust differences throughout training and enhance adaptation. Additionally, a pseudo-label confidence filtering module is integrated into the semi-supervised process to refine pseudo-label generation and improve the estimation of conditional distributions. Extensive experiments on EEG benchmark databases (SEED, SEED-IV and DEAP) validate the robustness and effectiveness of SDA-DDA. The results demonstrate its superiority over existing methods in emotion recognition across various scenarios, including cross-subject and cross-session conditions. This advancement enhances the generalization and accuracy of emotion recognition, potentially fostering the development of personalized aBCI applications. The source code is accessible at https://github.com/XuanSuTrum/SDA-DDA.
中文: 本文提出SDA-DDA这一新型迁移学习框架,通过动态对齐概率分布和优化伪标签生成,有效解决情感脑机接口中的个体差异问题,在多个EEG数据集上展现出卓越的跨被试情感识别性能。
English: This paper introduces SDA-DDA, a novel transfer learning framework that addresses individual variability in affective brain-computer interfaces by dynamically aligning probability distributions and refining pseudo-labels, demonstrating superior emotion recognition performance across multiple EEG datasets.

Authors:Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan
Title: Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
Abstract:
Despite significant progress on popular multimodal benchmarks, state-of-the-art Multimodal Large Language Models (MLLMs) continue to struggle with basic visual reasoning tasks that are trivially solved by humans, such as recognizing spatial relationships. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment. These subtests span four core domains of human visual cognition: (1) Visualization and Spatial Processing, (2) Perceptual and Closure, (3) Memory, and (4) Reasoning. We evaluate 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families. The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that current MLLM performance gains on high-level benchmarks do not reflect human-like low-level visual cognition, challenging the assumption that large-scale pretraining naturally induces gestalt-like perceptual capabilities. The dataset and evaluation toolkit are publicly available at: https://github.com/CUHK-ARISE/VisFactor.
Chinese: VisFactor基准测试显示,当前多模态大语言模型在人类基础视觉推理任务上表现不佳,评估了20个模型在四个核心视觉认知领域的表现,最高得分仅为25.19分(满分100分),远低于人类水平。
English: Current Multimodal Large Language Models perform poorly on basic visual reasoning tasks compared to humans, as revealed by the VisFactor benchmark, which evaluates 20 models across four core visual cognition domains and finds the highest score to be only 25.19 out of 100.

Authors:Guifang Xu, Zhiling Zhu, Xingcheng Guo, Wei Wang
Title: A Joint Learning Framework for Bridging Defect Prediction and Interpretation
Abstract:
Over the past fifty years, numerous software defect prediction (SDP) approaches have been proposed. However, the ability to explain why predictors make certain predictions remains limited. Explainable SDP has emerged as a promising solution by using explainable artificial intelligence (XAI) methods to clarify the decision-making processes of predictors. Despite this progress, there is still significant potential to enhance the reliability of existing approaches. To address this limitation, we treat defect prediction and the corresponding interpretation as two distinct but closely related tasks and propose a joint learning framework that allows for the simultaneous training of the predictor and its interpreter. The novelty of our approach lies in two main aspects: 1. We design feedback loops that convey the decision-making logic from the predictor to the interpreter. This ensures a high level of conciseness in decision logic and feature engineering for both the predictor and the interpreter, enabling the interpreter to achieve reliable local and global interpretability. 2. We incorporate the interpretation results as a penalty term in the loss function of the joint-learning framework. This not only improves the accuracy of the predictor but also imposes a stronger constraint on the reliability of the interpreter. We validated our proposed method against several existing explainable SDPs across multiple datasets. The results demonstrate its effectiveness in both interpretation and defect prediction. The source code for the proposed method is available at: https://github.com/BugPredictor/software-defect-prediction.git
中文: 本文提出一种联合学习框架,通过反馈循环和惩罚项同时训练软件缺陷预测器与解释器,在提升预测精度的同时增强了解释结果的可靠性。
English: This paper introduces a joint learning framework that simultaneously trains software defect predictors and interpreters through feedback loops and penalty terms, enhancing both prediction accuracy and interpretability reliability.

Authors:Kaibin Zhou, Kaifeng Huang, Hao Deng, Zelin Tao, Ziniu Liu, Lin Zhang, Shengjie Zhao
Title: Learning from Rendering: Realistic and Controllable Extreme Rainy Image Synthesis for Autonomous Driving Simulation
Abstract:
Autonomous driving simulators provide an effective and low-cost alternative for evaluating or enhancing visual perception models. However, the reliability of evaluation depends on the diversity and realism of the generated scenes. Extreme weather conditions, particularly extreme rainfalls, are rare and costly to capture in real-world settings. While simulated environments can help address this limitation, existing rainy image synthesizers often suffer from poor controllability over illumination and limited realism, which significantly undermines the effectiveness of the model evaluation. To that end, we propose a learning-from-rendering rainy image synthesizer, which combines the benefits of the realism of rendering-based methods and the controllability of learning-based methods. To validate the effectiveness of our extreme rainy image synthesizer on semantic segmentation task, we require a continuous set of well-labeled extreme rainy images. By integrating the proposed synthesizer with the CARLA driving simulator, we develop CARLARain an extreme rainy street scene simulator which can obtain paired rainy-clean images and labels under complex illumination conditions. Qualitative and quantitative experiments validate that CARLARain can effectively improve the accuracy of semantic segmentation models in extreme rainy scenes, with the models' accuracy (mIoU) improved by 5% - 8% on the synthetic dataset and significantly enhanced in real extreme rainy scenarios under complex illuminations. Our source code and datasets are available at https://github.com/kb824999404/CARLARain/.
中文: 自动驾驶模拟器为视觉感知模型的评估提供了低成本方案,但其可靠性依赖于场景的真实性,尤其针对极端降雨等罕见天气;现有合成器在可控性和真实性上不足,而提出的CARLARain模拟器融合渲染真实性和学习可控性,能生成配对的雨天-晴天图像,将语义分割模型在极端降雨场景下的准确率提升5%-8%。
English: Autonomous driving simulators are cost-effective for testing perception models, but their reliability depends on scene realism, especially for rare extreme weather like rainfall, which existing synthesizers handle poorly; the proposed CARLARain simulator combines rendering realism with learning-based controllability to generate paired rainy-clean images, improving semantic segmentation accuracy by 5%-8% in synthetic and real extreme rainy conditions.

Authors:Jianbin Jiao, Xina Cheng, Kailun Yang, Xiangrong Zhang, Licheng Jiao
Title: DeProPose: Deficiency-Proof 3D Human Pose Estimation via Adaptive Multi-View Fusion
Abstract:
3D human pose estimation has wide applications in fields such as intelligent surveillance, motion capture, and virtual reality. However, in real-world scenarios, issues such as occlusion, noise interference, and missing viewpoints can severely affect pose estimation. To address these challenges, we introduce the task of Deficiency-Aware 3D Pose Estimation. Traditional 3D pose estimation methods often rely on multi-stage networks and modular combinations, which can lead to cumulative errors and increased training complexity, making them unable to effectively address deficiency-aware estimation. To this end, we propose DeProPose, a flexible method that simplifies the network architecture to reduce training complexity and avoid information loss in multi-stage designs. Additionally, the model innovatively introduces a multi-view feature fusion mechanism based on relative projection error, which effectively utilizes information from multiple viewpoints and dynamically assigns weights, enabling efficient integration and enhanced robustness to overcome deficiency-aware 3D Pose Estimation challenges. Furthermore, to thoroughly evaluate this end-to-end multi-view 3D human pose estimation model and to advance research on occlusion-related challenges, we have developed a novel 3D human pose estimation dataset, termed the Deficiency-Aware 3D Pose Estimation (DA-3DPE) dataset. This dataset encompasses a wide range of deficiency scenarios, including noise interference, missing viewpoints, and occlusion challenges. Compared to state-of-the-art methods, DeProPose not only excels in addressing the deficiency-aware problem but also shows improvement in conventional scenarios, providing a powerful and user-friendly solution for 3D human pose estimation. The source code will be available at https://github.com/WUJINHUAN/DeProPose.
中文: DeProPose 提出了一种简化网络结构和多视角特征融合机制的方法,有效应对缺陷感知的3D姿态估计问题,其性能优于现有技术,并配有专门针对遮挡和噪声场景的新数据集。
English: DeProPose introduces a simplified network with a multi-view feature fusion mechanism to address deficiency-aware 3D pose estimation challenges, outperforming existing methods and supported by a novel dataset for occlusion and noise scenarios.

Authors:Liancheng Fang, Aiwei Liu, Hengrui Zhang, Henry Peng Zou, Weizhi Zhang, Philip S. Yu
Title: TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation
Abstract:
Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM's performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of $3.5\%-42.2\%$ on fidelity metrics. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data. The code is provided in the \href{https://github.com/fangliancheng/TabGEN-ICL}{link}.
中文: 本文提出TabGen-ICL框架,通过迭代选择代表性样本来缩小生成数据与真实分布之间的差距,在五个真实表格数据集上的实验表明,该方法较随机选择策略显著提升了生成质量,错误率降低了3.5%至42.2%。
English: This paper introduces TabGen-ICL, an iterative in-context learning framework that enhances tabular data generation by selecting representative examples to narrow the gap between synthetic and real data, significantly outperforming random selection with error reductions of 3.5% to 42.2%.

Authors:Xichen Xu, Yanshu Wang, Yawen Huang, Jiaqi Liu, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Title: A Survey on Industrial Anomalies Synthesis
Abstract:
This paper comprehensively reviews anomaly synthesis methodologies. Existing surveys focus on limited techniques, missing an overall field view and understanding method interconnections. In contrast, our study offers a unified review, covering about 40 representative methods across Hand-crafted, Distribution-hypothesis-based, Generative models (GM)-based, and Vision-language models (VLM)-based synthesis. We introduce the first industrial anomaly synthesis (IAS) taxonomy. Prior works lack formal classification or use simplistic taxonomies, hampering structured comparisons and trend identification. Our taxonomy provides a fine-grained framework reflecting methodological progress and practical implications, grounding future research. Furthermore, we explore cross-modality synthesis and large-scale VLM. Previous surveys overlooked multimodal data and VLM in anomaly synthesis, limiting insights into their advantages. Our survey analyzes their integration, benefits, challenges, and prospects, offering a roadmap to boost IAS with multimodal learning. More resources are available at https://github.com/M-3LAB/awesome-anomaly-synthesis.
本文对异常合成方法进行了统一评述,首次提出工业分类体系并探索跨模态技术,为该领域的未来发展提供了系统框架。
This paper presents a unified review of anomaly synthesis methods, introducing the first industrial taxonomy and exploring cross-modality techniques to advance the field.

Authors:Kyungbok Lee, You Zhang, Zhiyao Duan
Title: Audio Visual Segmentation Through Text Embeddings
Abstract:
The goal of Audio-Visual Segmentation (AVS) is to localize and segment the sounding source objects from video frames. Research on AVS suffers from data scarcity due to the high cost of fine-grained manual annotations. Recent works attempt to overcome the challenge of limited data by leveraging the vision foundation model, Segment Anything Model (SAM), prompting it with audio to enhance its ability to segment sounding source objects. While this approach alleviates the model's burden on understanding visual modality by utilizing knowledge of pre-trained SAM, it does not address the fundamental challenge of learning audio-visual correspondence with limited data. To address this limitation, we propose \textbf{AV2T-SAM}, a novel framework that bridges audio features with the text embedding space of pre-trained text-prompted SAM. Our method leverages multimodal correspondence learned from rich text-image paired datasets to enhance audio-visual alignment. Furthermore, we introduce a novel feature, $\mathbf{\textit{\textbf{f}}_{CLIP} \odot \textit{\textbf{f}}_{CLAP}}$, which emphasizes shared semantics of audio and visual modalities while filtering irrelevant noise. Our approach outperforms existing methods on the AVSBench dataset by effectively utilizing pre-trained segmentation models and cross-modal semantic alignment. The source code is released at https://github.com/bok-bok/AV2T-SAM.
中文:提出的AV2T-SAM框架通过将音频特征与SAM的文本嵌入空间相连接并利用跨模态语义对齐,有效提升了音频-视觉分割性能,在AVSBench数据集上超越了现有方法。
English: The proposed AV2T-SAM framework enhances audio-visual segmentation by bridging audio features with the text embedding space of SAM and leveraging cross-modal semantic alignment, outperforming existing methods on the AVSBench dataset.

Authors:Zahra Shahrooei, Ali Baheri
Title: Optimal Transport-Guided Safety in Temporal Difference Reinforcement Learning
Abstract:
The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance, frequently without considering safety. In contrast, safe reinforcement learning seeks to reduce or avoid unsafe behavior. This paper views safety as taking actions with more predictable consequences under environment stochasticity and introduces a temporal difference algorithm that uses optimal transport theory to quantify the uncertainty associated with actions. By integrating this uncertainty score into the decision-making objective, the agent is encouraged to favor actions with more predictable outcomes. We theoretically prove that our algorithm leads to a reduction in the probability of visiting unsafe states. We evaluate the proposed algorithm on several case studies in the presence of various forms of environment uncertainty. The results demonstrate that our method not only provides safer behavior but also maintains the performance. A Python implementation of our algorithm is available at \href{https://github.com/SAILRIT/Risk-averse-TD-Learning}{https://github.com/SAILRIT/OT-guided-TD-Learning}.
Chinese: 本文提出了一种安全强化学习算法,利用最优传输理论量化行动不确定性,引导智能体选择结果更可预测的行动,在降低不安全状态访问概率的同时保持性能。
English: This paper introduces a safe reinforcement learning algorithm that uses optimal transport theory to quantify action uncertainty, encouraging predictable outcomes and reducing unsafe state visits while maintaining performance.

Authors:Alexander Kolpakov, Aidan Rocke
Title: Benford's Law from Turing Ensembles and Integer Partitions
Abstract:
We develop two complementary generative mechanisms that explain when and why Benford's first-digit law arises. First, a probabilistic Turing machine (PTM) ensemble induces a geometric law for code length. Maximizing its entropy under a constraint on halting length yields Benford statistics. This model shows a phase transition with respect to the halt probability. Second, a constrained partition model (Einstein-solid combinatorics) recovers the same logarithmic profile as the maximum-entropy solution under a coarse-grained entropy-rate constraint, clarifying the role of non-ergodicity (ensemble vs. trajectory averages). We also perform numerical experiments that corroborate our conclusions.
中文: 本研究通过概率图灵机集合和受限分配模型,从熵最大化和相变角度解释了本福特定律的产生机制,并通过数值实验验证了结论。
English: This study presents two generative models—a probabilistic Turing machine ensemble and a constrained partition model—that explain the emergence of Benford's law through entropy maximization and phase transitions, supported by numerical validation.

Authors:Alexander Kolpakov, Aidan Rocke
Title: Benford's Law from Turing Ensembles and Integer Partitions
Abstract:
We develop two complementary generative mechanisms that explain when and why Benford's first-digit law arises. First, a probabilistic Turing machine (PTM) ensemble induces a geometric law for codelength. Maximizing its entropy under a constraint on halting length yields Benford statistics. This model shows a phase transition with respect to the halt probability. Second, a constrained partition model (Einstein-solid combinatorics) recovers the same logarithmic profile as the maximum-entropy solution under a coarse-grained entropy-rate constraint, clarifying the role of non-ergodicity (ensemble vs. trajectory averages). We also perform numerical experiments that corroborate our conclusions.
中文: 本研究通过概率图灵机集合和受限分配模型,从熵最大化和相变角度解释了本福特定律的产生机制,并通过数值实验验证了结论。
English: This study presents two generative models—a probabilistic Turing machine ensemble and a constrained partition model—that explain the emergence of Benford's law through entropy maximization and phase transitions, supported by numerical validation.

Authors:Tuan-Anh Yang, Truong-Son Hy, Phuong D. Dao
Title: MOB-GCN: A Novel Multiscale Object-Based Graph Neural Network for Hyperspectral Image Classification
Abstract:
This paper introduces a novel multiscale object-based graph neural network called MOB-GCN for hyperspectral image (HSI) classification. The central aim of this study is to enhance feature extraction and classification performance by utilizing multiscale object-based image analysis (OBIA). Traditional pixel-based methods often suffer from low accuracy and speckle noise, while single-scale OBIA approaches may overlook crucial information of image objects at different levels of detail. MOB-GCN addresses this issue by extracting and integrating features from multiple segmentation scales to improve classification results using the Multiresolution Graph Network (MGN) architecture that can model fine-grained and global spatial patterns. By constructing a dynamic multiscale graph hierarchy, MOB-GCN offers a more comprehensive understanding of the intricate details and global context of HSIs. Experimental results demonstrate that MOB-GCN consistently outperforms single-scale graph convolutional networks (GCNs) in terms of classification accuracy, computational efficiency, and noise reduction, particularly when labeled data is limited. The implementation of MOB-GCN is publicly available at https://github.com/HySonLab/MultiscaleHSI
中文: 本文提出MOB-GCN多尺度对象图神经网络,通过整合多尺度分割特征改进高光谱图像分类,在精度和效率上均优于单尺度方法。
English: This paper presents MOB-GCN, a multiscale object-based graph neural network that enhances hyperspectral image classification by integrating features from multiple segmentation scales, demonstrating superior accuracy and efficiency over single-scale methods.

Authors:Abdelrahman Hussein
Title: Finite Element Theory for PHIMATS
Abstract:
This document summarizes the main ideas of the finite element method (FEM) theory and constitutive relations as implemented in the PHIMATS code (\href{https://github.com/ahcomat/PHIMATS.git}{GitHub Repository}). Rather than detailing the derivations or specific models, this document focuses on the key mathematical foundations and numerical strategies used within the implementation. For in-depth theoretical discussions, the reader is encouraged to consult the references. For citing this document, please use ... Hands-on examples can be found in CaseStudies directory on the GitHub repository. .
本文概述了PHIMATS代码中实现的有限元方法及本构关系的核心数学原理与数值策略,并引导读者查阅GitHub仓库中的案例及参考文献以获取深入理论探讨。
This document outlines the core mathematical principles and numerical approaches of the finite element method and constitutive relations as implemented in the PHIMATS code, directing readers to the GitHub repository for examples and references for further theoretical details.

Authors:Megan Tjandrasuwita, Chanakya Ekbote, Liu Ziyin, Paul Pu Liang
Title: Understanding the Emergence of Multimodal Representation Alignment
Abstract:
Multimodal representation learning is fundamentally about transforming incomparable modalities into comparable representations. While prior research primarily focused on explicitly aligning these representations through targeted learning objectives and model architectures, a recent line of work has found that independently trained unimodal models of increasing scale and performance can become implicitly aligned with each other. These findings raise fundamental questions regarding the emergence of aligned representations in multimodal learning. Specifically: (1) when and why does alignment emerge implicitly? and (2) is alignment a reliable indicator of performance? Through a comprehensive empirical investigation, we demonstrate that both the emergence of alignment and its relationship with task performance depend on several critical data characteristics. These include, but are not necessarily limited to, the degree of similarity between the modalities and the balance between redundant and unique information they provide for the task. Our findings suggest that alignment may not be universally beneficial; rather, its impact on performance varies depending on the dataset and task. These insights can help practitioners determine whether increasing alignment between modalities is advantageous or, in some cases, detrimental to achieving optimal performance. Code is released at https://github.com/MeganTj/multimodal_alignment.
中文摘要:最新研究表明,独立训练的大规模单模态模型会自发产生隐式对齐,但其效果取决于模态相似性和信息平衡等数据特征,因此对齐并非总能提升任务性能。
English Summary: Recent research reveals that implicit alignment emerges in independently trained large-scale unimodal models, but its effectiveness depends on data characteristics like modality similarity and information balance, making alignment not universally beneficial for performance.

Authors:Yang Xiang, Li Fan, Chenke Yin, Menglin Kong, Chengtao Ji
Title: Harnessing Light for Cold-Start Recommendations: Leveraging Epistemic Uncertainty to Enhance Performance in User-Item Interactions
Abstract:
Most recent paradigms of generative model-based recommendation still face challenges related to the cold-start problem. Existing models addressing cold item recommendations mainly focus on acquiring more knowledge to enrich embeddings or model inputs. However, many models do not assess the efficiency with which they utilize the available training knowledge, leading to the extraction of significant knowledge that is not fully used, thus limiting improvements in cold-start performance. To address this, we introduce the concept of epistemic uncertainty to indirectly define how efficiently a model uses the training knowledge. Since epistemic uncertainty represents the reducible part of the total uncertainty, we can optimize the recommendation model further based on epistemic uncertainty to improve its performance. To this end, we propose a Cold-Start Recommendation based on Epistemic Uncertainty (CREU) framework. Additionally, CREU is inspired by Pairwise-Distance Estimators (PaiDEs) to efficiently and accurately measure epistemic uncertainty by evaluating the mutual information between model outputs and weights in high-dimensional spaces. The proposed method is evaluated through extensive offline experiments on public datasets, which further demonstrate the advantages and robustness of CREU. The source code is available at https://github.com/EsiksonX/CREU.
中文:提出的CREU框架通过利用认知不确定性来优化训练知识的利用效率,从而解决推荐系统中的冷启动问题,实验证明其具有更优的性能和鲁棒性。
English: The proposed CREU framework addresses the cold-start problem in recommendation systems by leveraging epistemic uncertainty to optimize training knowledge utilization, demonstrating improved performance and robustness in experiments.

Authors:Arshia Afzal, Elias Abad Rocamora, Leyla Naz Candogan, Pol Puigdemont, Francesco Tonin, Yongtao Wu, Mahsa Shoaran, Volkan Cevher
Title: Linear Attention for Efficient Bidirectional Sequence Modeling
Abstract:
Transformers with linear attention enable fast and parallel training. Moreover, they can be formulated as Recurrent Neural Networks (RNNs), for efficient linear-time inference. While extensively evaluated in causal sequence modeling, they have yet to be extended to the bidirectional setting. This work introduces the LION framework, establishing new theoretical foundations for linear transformers in bidirectional sequence modeling. LION constructs a bidirectional RNN equivalent to full Linear Attention. This extends the benefits of linear transformers: parallel training, and efficient inference, into the bidirectional setting. Using LION, we cast three linear transformers to their bidirectional form: LION-LIT, the bidirectional variant corresponding to (Katharopoulos et al., 2020); LION-D, extending RetNet (Sun et al., 2023); and LION-S, a linear transformer with a stable selective mask inspired by selectivity of SSMs (Dao & Gu, 2024). Replacing the attention block with LION (-LIT, -D, -S) achieves performance on bidirectional tasks that approaches that of Transformers and State-Space Models (SSMs), while delivering significant improvements in training speed. Our implementation is available in http://github.com/LIONS-EPFL/LION.
Chinese: LION框架将线性变换器扩展至双向序列建模,实现了并行训练与高效推理,在双向任务中性能接近Transformer和状态空间模型,同时显著提升了训练速度。
English: The LION framework extends linear transformers to bidirectional sequence modeling, enabling parallel training and efficient inference while achieving performance comparable to Transformers and SSMs with improved training speed.

Authors:Sayedmohammadreza Rastegari, Sina Tabakhi, Xianyuan Liu, Wei Sang, Haiping Lu
Title: Co-evolution-based Metal-binding Residue Prediction with Graph Neural Networks
Abstract:
In computational structural biology, predicting metal-binding sites and their corresponding metal types is challenging due to the complexity of protein structures and interactions. Conventional sequence- and structure-based prediction approaches cannot capture the complex evolutionary relationships driving these interactions to facilitate understanding, while recent co-evolution-based approaches do not fully consider the entire structure of the co-evolved residue network. In this paper, we introduce MBGNN (Metal-Binding Graph Neural Network) that utilizes the entire co-evolved residue network and effectively captures the complex dependencies within protein structures via graph neural networks to enhance the prediction of co-evolved metal-binding residues and their associated metal types. Experimental results on a public dataset show that MBGNN outperforms existing co-evolution-based metal-binding prediction methods, and it is also competitive against recent sequence-based methods, showing the potential of integrating co-evolutionary insights with advanced machine learning to deepen our understanding of protein-metal interactions. The MBGNN code is publicly available at https://github.com/SRastegari/MBGNN.
Chinese: MBGNN是一种图神经网络模型,通过利用完整的共进化残基网络,提高了蛋白质中金属结合位点及其类型的预测准确性,优于现有方法,并深化了对蛋白质-金属相互作用的理解。
English: MBGNN, a graph neural network model, improves the prediction of metal-binding sites and their types in proteins by leveraging the full co-evolved residue network, outperforming existing methods and advancing the understanding of protein-metal interactions.

Authors:Chunyang Li, Weiqi Wang, Tianshi Zheng, Yangqiu Song
Title: Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations
Abstract:
Inductive reasoning, a cornerstone of human cognition, enables generalization from limited data but hasn't yet been fully achieved by large language models (LLMs). While modern LLMs excel at reasoning tasks, their ability to maintain stable and consistent rule abstraction under imperfect observations remains underexplored. To fill this gap, in this work, we introduce Robust Rule Induction, a task that evaluates LLMs' capability in inferring rules from data that are fused with noisy examples. To address this task, we further propose Sample-steered Rule Refinement (SRR), a method enhancing reasoning stability via observation diversification and execution-guided feedback. Experiments across arithmetic, cryptography, and list functions reveal: (1) SRR outperforms other methods with minimal performance degradation under noise; (2) Despite slight accuracy variation, LLMs exhibit instability under noise (e.g., 0% accuracy change with only 70% consistent score); (3) Counterfactual task gaps highlight LLMs' reliance on memorized patterns over genuine abstraction. Our findings challenge LLMs' reasoning robustness, revealing susceptibility to hypothesis drift and pattern overfitting, while providing empirical evidence critical for developing human-like inductive systems. Code and data are available at https://github.com/HKUST-KnowComp/Robust-Rule-Induction.
中文: 本研究提出鲁棒规则归纳任务来评估大语言模型从含噪声数据中推断规则的能力,通过样本引导规则优化方法结合观察多样化和执行反馈提升推理稳定性,实验结果表明尽管准确率变化微小,模型仍存在假设漂移和模式过拟合的脆弱性。
English: This study introduces Robust Rule Induction to assess large language models' ability to infer rules from noisy data, proposing the Sample-steered Rule Refinement method that enhances reasoning stability through observation diversification and execution feedback, while experimental results reveal models' vulnerability to hypothesis drift and pattern overfitting despite minimal accuracy changes.

Authors:Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai
Title: OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
Abstract:
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input\&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal large language model structure, further enhancing text localization and recognition capabilities, thereby confirming the generality of SPOT prompting technique. The code is available at \href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.
中文摘要:OmniParser V2通过提出的结构化思维点(SPOT)提示模式,将视觉文本解析中的多个任务统一到单一框架中,简化了处理流程并在多个数据集上取得了领先或竞争性的性能表现。
English Summary: OmniParser V2 introduces a unified framework using Structured-Points-of-Thought (SPOT) prompting to simplify visually-situated text parsing by integrating multiple tasks into a single model, achieving state-of-the-art performance across various datasets.

Authors:Anton Pogrebnjak, Julian Schelb, Andreas Spitz, Celina Kacperski, Roberto Ulloa
Title: Tag-Pag: A Dedicated Tool for Systematic Web Page Annotations
Abstract:
Tag-Pag is an application designed to simplify the categorization of web pages, a task increasingly common for researchers who scrape web pages to analyze individuals' browsing patterns or train machine learning classifiers. Unlike existing tools that focus on annotating sections of text, Tag-Pag systematizes page-level annotations, allowing users to determine whether an entire document relates to one or multiple predefined topics. Tag-Pag offers an intuitive interface to configure the input web pages and annotation labels. It integrates libraries to extract content from the HTML and URL indicators to aid the annotation process. It provides direct access to both scraped and live versions of the web page. Our tool is designed to expedite the annotation process with features like quick navigation, label assignment, and export functionality, making it a versatile and efficient tool for various research applications. Tag-Pag is available at https://github.com/Pantonius/TagPag.
中文:Tag-Pag是一款简化网页分类的应用,通过系统化整页标注和直观界面,帮助研究人员快速确定文档主题并导出数据,有效提升网页注释效率。
English: Tag-Pag is a web page categorization tool that streamlines page-level annotation for researchers, featuring an intuitive interface and efficient navigation to classify entire documents by predefined topics and export results.

Authors:Beibei Li, Tao Xiang, Beihong Jin, Yiyuan Zheng, Rui Zhao
Title: Semantic Gaussian Mixture Variational Autoencoder for Sequential Recommendation
Abstract:
Variational AutoEncoder (VAE) for Sequential Recommendation (SR), which learns a continuous distribution for each user-item interaction sequence rather than a determinate embedding, is robust against data deficiency and achieves significant performance. However, existing VAE-based SR models assume a unimodal Gaussian distribution as the prior distribution of sequence representations, leading to restricted capability to capture complex user interests and limiting recommendation performance when users have more than one interest. Due to that it is common for users to have multiple disparate interests, we argue that it is more reasonable to establish a multimodal prior distribution in SR scenarios instead of a unimodal one. Therefore, in this paper, we propose a novel VAE-based SR model named SIGMA. SIGMA assumes that the prior of sequence representation conforms to a Gaussian mixture distribution, where each component of the distribution semantically corresponds to one of multiple interests. For multi-interest elicitation, SIGMA includes a probabilistic multi-interest extraction module that learns a unimodal Gaussian distribution for each interest according to implicit item hyper-categories. Additionally, to incorporate the multimodal interests into sequence representation learning, SIGMA constructs a multi-interest-aware ELBO, which is compatible with the Gaussian mixture prior. Extensive experiments on public datasets demonstrate the effectiveness of SIGMA. The code is available at https://github.com/libeibei95/SIGMA.
中文:提出的SIGMA模型通过采用高斯混合先验来捕捉用户的多重兴趣,克服了传统单峰高斯假设的局限,并借助新颖的多兴趣提取模块和适配的ELBO框架,在序列推荐中展现出卓越性能。
English: The proposed SIGMA model enhances sequential recommendation by employing a Gaussian mixture prior to capture users' multiple interests, overcoming the limitations of traditional unimodal Gaussian assumptions and demonstrating superior performance through a novel multi-interest extraction module and an adapted ELBO framework.

Authors:Feng Liu, Hanyang Wang, Siyuan Shen
Title: Robust Dynamic Facial Expression Recognition
Abstract:
The study of Dynamic Facial Expression Recognition (DFER) is a nascent field of research that involves the automated recognition of facial expressions in video data. Although existing research has primarily focused on learning representations under noisy and hard samples, the issue of the coexistence of both types of samples remains unresolved. In order to overcome this challenge, this paper proposes a robust method of distinguishing between hard and noisy samples. This is achieved by evaluating the prediction agreement of the model on different sampled clips of the video. Subsequently, methodologies that reinforce the learning of hard samples and mitigate the impact of noisy samples can be employed. Moreover, to identify the principal expression in a video and enhance the model's capacity for representation learning, comprising a key expression re-sampling framework and a dual-stream hierarchical network is proposed, namely Robust Dynamic Facial Expression Recognition (RDFER). The key expression re-sampling framework is designed to identify the key expression, thereby mitigating the potential confusion caused by non-target expressions. RDFER employs two sequence models with the objective of disentangling short-term facial movements and long-term emotional changes. The proposed method has been shown to outperform current State-Of-The-Art approaches in DFER through extensive experimentation on benchmark datasets such as DFEW and FERV39K. A comprehensive analysis provides valuable insights and observations regarding the proposed agreement. This work has significant implications for the field of dynamic facial expression recognition and promotes the further development of the field of noise-consistent robust learning in dynamic facial expression recognition. The code is available from [https://github.com/Cross-Innovation-Lab/RDFER].
中文: 本文提出RDFER方法,通过评估预测一致性区分困难与噪声样本,并采用双流网络加强表征学习,在动态面部表情识别基准数据集上实现了最优性能。
English: This paper introduces RDFER, a robust method for dynamic facial expression recognition that distinguishes hard and noisy samples through prediction agreement evaluation and employs a dual-stream network to enhance representation learning, achieving state-of-the-art results on benchmark datasets.

Authors:Heng Gao, Zhuolin He, Jian Pu
Title: Detecting OOD Samples via Optimal Transport Scoring Function
Abstract:
To deploy machine learning models in the real world, researchers have proposed many OOD detection algorithms to help models identify unknown samples during the inference phase and prevent them from making untrustworthy predictions. Unlike methods that rely on extra data for outlier exposure training, post hoc methods detect Out-of-Distribution (OOD) samples by developing scoring functions, which are model agnostic and do not require additional training. However, previous post hoc methods may fail to capture the geometric cues embedded in network representations. Thus, in this study, we propose a novel score function based on the optimal transport theory, named OTOD, for OOD detection. We utilize information from features, logits, and the softmax probability space to calculate the OOD score for each test sample. Our experiments show that combining this information can boost the performance of OTOD with a certain margin. Experiments on the CIFAR-10 and CIFAR-100 benchmarks demonstrate the superior performance of our method. Notably, OTOD outperforms the state-of-the-art method GEN by 7.19% in the mean FPR@95 on the CIFAR-10 benchmark using ResNet-18 as the backbone, and by 12.51% in the mean FPR@95 using WideResNet-28 as the backbone. In addition, we provide theoretical guarantees for OTOD. The code is available in https://github.com/HengGao12/OTOD.
中文: 本研究提出基于最优传输理论的OTOD方法,通过综合特征、逻辑值和Softmax概率信息进行分布外检测,在CIFAR基准测试中显著优于现有最优方法。
English: This study introduces OTOD, a novel post hoc OOD detection method using optimal transport theory that outperforms state-of-the-art approaches by combining features, logits, and softmax probabilities, achieving significant improvements on CIFAR benchmarks.

Authors:Patrick Tser Jern Kon, Jiachen Liu, Qiuyi Ding, Yiming Qiu, Zhenning Yang, Yibo Huang, Jayanth Srinivasa, Myungjin Lee, Mosharaf Chowdhury, Ang Chen
Title: Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
Abstract:
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.
Chinese: Curie AI代理框架通过可靠性、控制性和可解释性模块将严谨性融入科学实验过程,在涵盖46个计算机科学问题的基准测试中,准确性提升了3.4倍。
English: The Curie AI agent framework enhances scientific experimentation by embedding rigor through reliability, control, and interpretability modules, achieving a 3.4x improvement in accuracy on a benchmark of 46 computer science questions.

Authors:Jathurshan Pradeepkumar, Xihao Piao, Zheng Chen, Jimeng Sun
Title: Tokenizing Single-Channel EEG with Time-Frequency Motif Learning
Abstract:
Foundation models are reshaping EEG analysis, yet an important problem of EEG tokenization remains a challenge. This paper presents TFM-Tokenizer, a novel tokenization framework that learns a vocabulary of time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens. We propose a dual-path architecture with time-frequency masking to capture robust motif representations, and it is model-agnostic, supporting both lightweight transformers and existing foundation models for downstream tasks. Our study demonstrates three key benefits: Accuracy: Experiments on four diverse EEG benchmarks demonstrate consistent performance gains across both single- and multi-dataset pretraining settings, achieving up to 17% improvement in Cohen's Kappa over strong baselines. Generalization: Moreover, as a plug-and-play component, it consistently boosts the performance of diverse foundation models, including BIOT and LaBraM. Scalability: By operating at the single-channel level rather than relying on the strict 10-20 EEG system, our method has the potential to be device-agnostic. Experiments on ear-EEG sleep staging, which differs from the pretraining data in signal format, channel configuration, recording device, and task, show that our tokenizer outperforms baselines by 14%. A comprehensive token analysis reveals strong class-discriminative, frequency-aware, and consistent structure, enabling improved representation quality and interpretability. Code is available at https://github.com/Jathurshan0330/TFM-Tokenizer.
中文: 本文提出TFM-Tokenizer这一模型无关的标记化框架,通过从单通道脑电信号中学习时频基元并编码为离散标记,在多种基准测试中显著提升性能,并凭借其准确性、泛化能力和可扩展性有效增强了现有基础模型。
English: This paper introduces TFM-Tokenizer, a model-agnostic framework that learns time-frequency motifs from single-channel EEG signals and encodes them into discrete tokens, achieving significant performance gains across diverse benchmarks and enhancing existing foundation models through improved accuracy, generalization, and scalability.

Authors:Zheling Tan, Kexin Ding, Jin Gao, Mu Zhou, Dimitris Metaxas, Shaoting Zhang, Dequan Wang
Title: MedForge: Building Medical Foundation Models Like Open Source Software Development
Abstract:
Foundational models (FMs) have made significant strides in the healthcare domain. Yet the data silo challenge and privacy concern remain in healthcare systems, hindering safe medical data sharing and collaborative model development among institutions. The collection and curation of scalable clinical datasets increasingly become the bottleneck for training strong FMs. In this study, we propose Medical Foundation Models Merging (MedForge), a cooperative framework enabling a community-driven medical foundation model development, meanwhile preventing the information leakage of raw patient data and mitigating synchronization model development issues across clinical institutions. MedForge offers a bottom-up model construction mechanism by flexibly merging task-specific Low-Rank Adaptation (LoRA) modules, which can adapt to downstream tasks while retaining original model parameters. Through an asynchronous LoRA module integration scheme, the resulting composite model can progressively enhance its comprehensive performance on various clinical tasks. MedForge shows strong performance on multiple clinical datasets (e.g., breast cancer, lung cancer, and colon cancer) collected from different institutions. Our major findings highlight the value of collaborative foundation models in advancing multi-center clinical collaboration effectively and cohesively. Our code is publicly available at https://github.com/TanZheling/MedForge.
中文: MedForge提出了一种通过融合任务特定LoRA模块的协作式医疗基础模型开发框架,既保护原始患者数据隐私又实现跨机构协同建模,在多种临床数据集上展现出卓越性能。
English: MedForge introduces a collaborative framework for developing medical foundation models by merging task-specific LoRA modules, enabling multi-institutional cooperation without sharing raw patient data and demonstrating strong performance across various clinical datasets.

Authors:Mike Ranzinger, Greg Heinrich, Pavlo Molchanov, Jan Kautz, Bryan Catanzaro, Andrew Tao
Title: FeatSharp: Your Vision Model Features, Sharper
Abstract:
The feature maps of vision encoders are fundamental to myriad modern AI tasks, ranging from core perception algorithms (e.g. semantic segmentation, object detection, depth perception, etc.) to modern multimodal understanding in vision-language models (VLMs). Currently, in computer vision, the frontier of general purpose vision backbones is Vision Transformers (ViT), typically trained using contrastive loss (e.g. CLIP). A key problem with most off-the-shelf ViTs, particularly CLIP, is that these models are inflexibly low resolution. Most run at $224 \times 224$px, while the "high-resolution" versions are around $378-448$px, but still inflexible. We introduce a novel method to coherently and cheaply upsample the feature maps of low-resolution vision encoders while picking up on fine-grained details that would otherwise be lost due to resolution. We demonstrate the effectiveness of this approach on core perception tasks as well as within agglomerative model training using RADIO as a way of providing richer targets for distillation. Code available at https://github.com/NVlabs/FeatSharp .
中文摘要:本文提出了一种新颖方法,能够高效地对低分辨率视觉编码器(如CLIP)的特征图进行上采样,捕捉通常因分辨率不足而丢失的精细细节,并在核心感知任务和模型蒸馏中验证了其有效性。
English Summary: This paper introduces a novel method to efficiently upsample feature maps from low-resolution vision encoders like CLIP, capturing fine-grained details that are typically lost, and demonstrates its effectiveness on core perception tasks and model distillation.

Authors:Prashant Shekhar, Bidur Devkota, Dumindu Samaraweera, Laxima Niure Kandel, Manoj Babu
Title: Cross-Model Transferability of Adversarial Patches in Real-time Segmentation for Autonomous Driving
Abstract:
Adversarial attacks pose a significant threat to deep learning models, particularly in safety-critical applications like healthcare and autonomous driving. Recently, patch based attacks have demonstrated effectiveness in real-time inference scenarios owing to their 'drag and drop' nature. Following this idea for Semantic Segmentation (SS), here we propose a novel Expectation Over Transformation (EOT) based adversarial patch attack that is more realistic for autonomous vehicles. To effectively train this attack we also propose a 'simplified' loss function that is easy to analyze and implement. Using this attack as our basis, we investigate whether adversarial patches once optimized on a specific SS model, can fool other models or architectures. We conduct a comprehensive cross-model transferability analysis of adversarial patches trained on SOTA Convolutional Neural Network (CNN) models such PIDNet-S, PIDNet-M and PIDNet-L, among others. Additionally, we also include the Segformer model to study transferability to Vision Transformers (ViTs). All of our analysis is conducted on the widely used Cityscapes dataset. Our study reveals key insights into how model architectures (CNN vs CNN or CNN vs. Transformer-based) influence attack susceptibility. In particular, we conclude that although the transferability (effectiveness) of attacks on unseen images of any dimension is really high, the attacks trained against one particular model are minimally effective on other models. And this was found to be true for both ViT and CNN based models. Additionally our results also indicate that for CNN-based models, the repercussions of patch attacks are local, unlike ViTs. Per-class analysis reveals that simple-classes like 'sky' suffer less misclassification than others. The code for the project is available at: https://github.com/p-shekhar/adversarial-patch-transferability
中文摘要:本研究针对自动驾驶中的语义分割任务,提出了一种基于期望变换的新型对抗性补丁攻击,并通过跨模型分析发现:尽管攻击在不同图像维度间具有高迁移性,但对特定模型的攻击对其他模型效果有限,且CNN模型受影响范围局部化,而视觉变换器受影响更广泛。
English Summary: This study introduces a novel adversarial patch attack using Expectation Over Transformation for semantic segmentation in autonomous vehicles, revealing through cross-model analysis that while attacks show high transferability across image dimensions, they remain model-specific with localized effects on CNNs compared to Vision Transformers.

Authors:Yuan Tian, Daniel Lee, Fei Wu, Tung Mai, Kun Qian, Siddhartha Sahai, Tianyi Zhang, Yunyao Li
Title: Text-to-SQL Domain Adaptation via Human-LLM Collaborative Data Annotation
Abstract:
Text-to-SQL models, which parse natural language (NL) questions to executable SQL queries, are increasingly adopted in real-world applications. However, deploying such models in the real world often requires adapting them to the highly specialized database schemas used in specific applications. We find that existing text-to-SQL models experience significant performance drops when applied to new schemas, primarily due to the lack of domain-specific data for fine-tuning. This data scarcity also limits the ability to effectively evaluate model performance in new domains. Continuously obtaining high-quality text-to-SQL data for evolving schemas is prohibitively expensive in real-world scenarios. To bridge this gap, we propose SQLsynth, a human-in-the-loop text-to-SQL data annotation system. SQLsynth streamlines the creation of high-quality text-to-SQL datasets through human-LLM collaboration in a structured workflow. A within-subjects user study comparing SQLsynth with manual annotation and ChatGPT shows that SQLsynth significantly accelerates text-to-SQL data annotation, reduces cognitive load, and produces datasets that are more accurate, natural, and diverse. Our code is available at https://github.com/adobe/nl_sql_analyzer.
中文: 针对文本转SQL模型在适应专业数据库模式时因缺乏领域数据而性能下降的问题,我们提出了SQLsynth系统,通过人机协作的工作流高效生成高质量数据集,显著提升了标注速度和数据质量。
English: Text-to-SQL models struggle with performance when adapting to specialized database schemas due to limited domain-specific data, prompting the development of SQLsynth, a human-in-the-loop system that enhances data annotation efficiency and quality through collaboration between humans and large language models.

Authors:Parth Bhalerao, Mounika Yalamarty, Brian Trinh, Oana Ignat
Title: Multi-Agent Multimodal Models for Multicultural Text to Image Generation
Abstract:
Large Language Models (LLMs) demonstrate impressive performance across various multimodal tasks. However, their effectiveness in cross-cultural contexts remains limited due to the predominantly Western-centric nature of existing data and models. Meanwhile, multi-agent models have shown strong capabilities in solving complex tasks. In this paper, we evaluate the performance of LLMs in a multi-agent interaction setting for the novel task of multicultural image generation. Our key contributions are: (1) We introduce MosAIG, a Multi-Agent framework that enhances multicultural Image Generation by leveraging LLMs with distinct cultural personas; (2) We provide a dataset of 9,000 multicultural images spanning five countries, three age groups, two genders, 25 historical landmarks, and five languages; and (3) We demonstrate that multi-agent interactions outperform simple, no-agent models across multiple evaluation metrics, offering valuable insights for future research. Our dataset and models are available at https://github.com/OanaIgnat/MosAIG.
Chinese: 本文提出MosAIG多智能体框架,通过利用具有不同文化角色的LLMs增强多元文化图像生成,并基于涵盖五个国家、三种年龄组等维度的9000张图像数据集,证明其在多项评估指标上优于无智能体模型。
English: This paper introduces MosAIG, a multi-agent framework that enhances multicultural image generation by leveraging LLMs with distinct cultural personas, demonstrating superior performance over no-agent models through a comprehensive dataset of 9,000 images across diverse cultural dimensions.

Authors:William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh
Title: Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
Abstract:
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.
Chinese: 多模态大语言模型在视觉数学推理上存在显著缺陷,依赖直觉联想而非系统分析,但通过视觉引导的提示方法(如VC-CoT)可大幅提升其表现。
English: Multimodal Large Language Models exhibit significant deficiencies in visual-mathematical reasoning, relying on intuitive associations rather than deliberate analysis, but their performance can be dramatically improved through visually-guided prompting techniques like VC-CoT.

Authors:Alan Zhu, Jiaqi Ma, Qiaozhu Mei
Title: Efficient Estimation of Shortest-Path Distance Distributions to Samples in Graphs
Abstract:
As large graph datasets become increasingly common across many fields, sampling is often needed to reduce the graphs into manageable sizes. This procedure raises critical questions about representativeness as no sample can capture the properties of the original graph perfectly, and different parts of the graph are not evenly affected by the loss. Recent work has shown that the distances from the non-sampled nodes to the sampled nodes can be a quantitative indicator of bias and fairness in graph machine learning. However, to our knowledge, there is no method for evaluating how a sampling method affects the distribution of shortest-path distances without actually performing the sampling and shortest-path calculation. In this paper, we present an accurate and efficient framework for estimating the distribution of shortest-path distances to the sample, applicable to a wide range of sampling methods and graph structures. Our framework is faster than empirical methods and only requires the specification of degree distributions. We also extend our framework to handle graphs with community structures. While this introduces a decrease in accuracy, we demonstrate that our framework remains highly accurate on downstream comparison-based tasks. Code is publicly available at https://github.com/az1326/shortest_paths.
中文: 本文提出了一种高效框架,用于估计大图中采样节点间最短路径距离的分布,适用于多种采样方法且仅需度分布信息,同时可扩展到具有社区结构的图。
English: This paper introduces an efficient framework for estimating the distribution of shortest-path distances to sampled nodes in large graphs, applicable to various sampling methods and requiring only degree distributions, while also extending to graphs with community structures.

Authors:Hongjie Zhu, Zeyu Zhang, Guansong Pang, Xu Wang, Shimin Wen, Yu Bai, Daji Ergu, Ying Cai, Yang Zhao
Title: DOEI: Dual Optimization of Embedding Information for Attention-Enhanced Class Activation Maps
Abstract:
Weakly supervised semantic segmentation (WSSS) typically utilizes limited semantic annotations to obtain initial Class Activation Maps (CAMs). However, due to the inadequate coupling between class activation responses and semantic information in high-dimensional space, the CAM is prone to object co-occurrence or under-activation, resulting in inferior recognition accuracy. To tackle this issue, we propose DOEI, Dual Optimization of Embedding Information, a novel approach that reconstructs embedding representations through semantic-aware attention weight matrices to optimize the expression capability of embedding information. Specifically, DOEI amplifies tokens with high confidence and suppresses those with low confidence during the class-to-patch interaction. This alignment of activation responses with semantic information strengthens the propagation and decoupling of target features, enabling the generated embeddings to more accurately represent target features in high-level semantic space. In addition, we propose a hybrid-feature alignment module in DOEI that combines RGB values, embedding-guided features, and self-attention weights to increase the reliability of candidate tokens. Comprehensive experiments show that DOEI is an effective plug-and-play module that empowers state-of-the-art visual transformer-based WSSS models to significantly improve the quality of CAMs and segmentation performance on popular benchmarks, including PASCAL VOC (+3.6%, +1.5%, +1.2% mIoU) and MS COCO (+1.2%, +1.6% mIoU). Code will be available at https://github.com/AIGeeksGroup/DOEI.
Chinese: DOEI是一种通过语义感知注意力优化嵌入表示的新方法,在类别到区块交互中增强高置信度标记并抑制低置信度标记,从而显著提升PASCAL VOC和MS COCO等基准测试中的分割性能。
English: DOEI is a novel approach that optimizes embedding representations through semantic-aware attention to enhance Class Activation Maps (CAMs) by amplifying high-confidence tokens and suppressing low-confidence ones, significantly improving segmentation performance on benchmarks like PASCAL VOC and MS COCO.

Authors:Aryan Jadon, Avinash Patil, Shashank Kumar
Title: Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models
Abstract:
Retrieval-Augmented Generation (RAG) systems face significant performance gaps when applied to technical domains requiring precise information extraction from complex documents. Current evaluation methodologies relying on document-level metrics inadequately capture token-resolution retrieval accuracy that is critical for domain-related documents. We propose a framework combining granular evaluation metrics with synthetic data generation to optimize domain-specific RAG performance. First, we introduce token-aware metrics Precision $Ω$ and Intersection-over-Union (IoU) that quantify context preservation versus information density trade-offs inherent in technical texts. Second, we develop a reasoning model-driven pipeline using instruction-tuned LLMs (DeepSeek-R1, DeepSeek-R1 distilled variants, and Phi-4) to generate context-anchored QA pairs with discontinuous reference spans across three specialized corpora: SEC 10-K filings (finance), biomedical abstracts (PubMed), and APT threat reports (cybersecurity). Our empirical analysis reveals critical insights: smaller chunks (less than 10 tokens) improve precision by 31-42% (IoU = 0.071 vs. baseline 0.053) at recall costs (-18%), while domain-specific embedding strategies yield 22% variance in optimal chunk sizing (5-20 tokens). The DeepSeek-R1-Distill-Qwen-32B model demonstrates superior concept alignment (+14% mean IoU over alternatives), though no configuration universally dominates. Financial texts favor larger chunks for risk factor coverage (Recall = 0.81 at size = 20), whereas cybersecurity content benefits from atomic segmentation, Precision $Ω= 0.28$ at size = 5. Our code is available on https://github.com/aryan-jadon/Synthetic-Data-Generation-and-Evaluation-using-Reasoning-Model
中文摘要:本研究提出了一种结合细粒度评估指标与合成数据生成的框架,以提升检索增强生成(RAG)在技术领域的性能,发现金融、生物医学和网络安全等专业文献的最佳文本块大小与嵌入策略存在显著差异。
English Summary: This study introduces a framework combining token-level evaluation metrics and synthetic data generation to enhance Retrieval-Augmented Generation (RAG) performance in technical domains, revealing that optimal chunk sizes and embedding strategies vary significantly across specialized corpora like finance, biomedical, and cybersecurity documents.

Authors:Haokun Chen, Sebastian Szyller, Weilin Xu, Nageen Himayat
Title: Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
Abstract:
Large language models (LLMs) are trained using massive datasets, which often contain undesirable content such as harmful texts, personal information, and copyrighted material. To address this, machine unlearning aims to remove information from trained models. Recent work has shown that soft token attacks (STA) can successfully extract unlearned information from LLMs, but in this work we show that STAs can be an inadequate tool for auditing unlearning. Using common benchmarks such as Who Is Harry Potter? and TOFU, we demonstrate that in a strong auditor setting such attacks can elicit any information from the LLM, regardless of the deployed unlearning algorithm or whether the queried content was originally present in the training corpus. We further show that STA with just a few soft tokens (1-10) can elicit random strings over 400 characters long, indicating that STAs must be used carefully to effectively audit unlearning. Example code can be found at: https://github.com/IntelLabs/LLMart/tree/main/examples/unlearning
Chinese: 软令牌攻击无法有效审核机器遗忘,因为它能从大语言模型中提取随机或任意信息,与遗忘算法或训练数据是否包含该内容无关。
English: Soft token attacks can ineffectively audit machine unlearning by extracting random or any information from large language models, regardless of unlearning methods or original training data presence.

Authors:Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, Jie Tang
Title: A Stronger Mixture of Low-Rank Experts for Fine-Tuning Foundation Models
Abstract:
In order to streamline the fine-tuning of foundation models, Low-Rank Adapters (LoRAs) have been substantially adopted across various fields, including instruction tuning and domain adaptation. The underlying concept of LoRA involves decomposing a full-rank matrix into the product of two lower-rank matrices, which reduces storage consumption and accelerates the training process. Furthermore, to address the limited expressive capacity of LoRA, the Mixture-of-Expert (MoE) has been introduced for incorporating multiple LoRA adapters. The integration of LoRA experts leads to a visible improvement across several downstream scenes. However, the mixture of LoRAs (MoE-LoRA) still exhibits its low robustness during tuning and inferring. Inspired by the Riemannian Preconditioners which train LoRA as a sub-space projector, we propose a new training strategy for MoE-LoRA, to stabilize and boost its feature learning procedure by multi-space projections. Examinations on SGD and AdamW optimizers demonstrate the effectiveness of our methodology. Source code is available at https://github.com/THUDM/MoELoRA_Riemannian.
中文: 针对混合专家低秩适配器(MoE-LoRA)在调优和推理中稳定性不足的问题,提出了一种基于多空间投影的新训练策略,通过在SGD和AdamW优化器上的测试验证了其有效性。
English: To enhance the robustness and feature learning of Mixture-of-Experts Low-Rank Adapters (MoE-LoRA), which face stability issues during tuning and inference, a novel training strategy using multi-space projections is proposed and validated on optimizers like SGD and AdamW.

Authors:Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, William Yang Wang
Title: InductionBench: LLMs Fail in the Simplest Complexity Class
Abstract:
Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.
中文摘要:大型语言模型在演绎推理方面表现出色,但在归纳推理方面存在明显不足,新基准测试InductionBench显示,即使是最先进的模型也难以从数据中推断出基本规则。
English Summary: Large language models excel in deductive reasoning but struggle with inductive reasoning, as shown by the new benchmark InductionBench, which reveals their difficulty in inferring rules from data despite their advanced capabilities.

Authors:Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper
Title: MaxSup: Overcoming Representation Collapse in Label Smoothing
Abstract:
Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS, consistently reducing overconfidence while preserving richer feature representations. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization
中文: 标签平滑会导致错误预测的过度自信和特征压缩,引发表示坍塌,而提出的最大抑制方法通过均匀正则化预测,恢复特征多样性并减少过度自信。
English: Label Smoothing is found to induce overconfidence in errors and compress features, leading to representation collapse, while the proposed Max Suppression method uniformly regularizes predictions to restore feature diversity and reduce overconfidence.

Authors:Yanyang Li, Michael Lyu, Liwei Wang
Title: Learning to Reason from Feedback at Test-Time
Abstract:
Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.
中文: 大语言模型常需迭代反馈解决复杂任务,本文提出的FTTT范式及其可学习优化器OpTune将反馈利用构建为测试时优化问题,有效克服现有方法的局限,在实验中展现出卓越的扩展性和性能。
English: Large language models often require iterative feedback to solve complex tasks, and the proposed FTTT paradigm with its learnable optimizer OpTune effectively addresses existing limitations by treating feedback utilization as a test-time optimization problem, achieving superior scalability and performance in experiments.

Authors:Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, Aviral Kumar
Title: Digi-Q: Learning Q-Value Functions for Training Device-Control Agents
Abstract:
While a number of existing approaches for building foundation model agents rely on prompting or fine-tuning with human demonstrations, it is not sufficient in dynamic environments (e.g., mobile device control). On-policy reinforcement learning (RL) should address these limitations, but collecting actual rollouts in an environment is often undesirable in truly open-ended agentic problems such as mobile device control or interacting with humans, where each unit of interaction is associated with a cost. In such scenarios, a method for policy learning that can utilize off-policy experience by learning a trained action-value function is much more effective. In this paper, we develop an approach, called Digi-Q, to train VLM-based action-value Q-functions which are then used to extract the agent policy. We study our approach in the mobile device control setting. Digi-Q trains the Q-function using offline temporal-difference (TD) learning, on top of frozen, intermediate-layer features of a VLM. Compared to fine-tuning the whole VLM, this approach saves us compute and enhances scalability. To make the VLM features amenable for representing the Q-function, we need to employ an initial phase of fine-tuning to amplify coverage over actionable information needed for value function. Once trained, we use this Q-function via a Best-of-N policy extraction operator that imitates the best action out of multiple candidate actions from the current policy as ranked by the value function, enabling policy improvement without environment interaction. Digi-Q outperforms several prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 21.2% improvement over prior best-performing method. In some cases, our Digi-Q approach already matches state-of-the-art RL methods that require interaction. The project is open-sourced at https://github.com/DigiRL-agent/digiq
中文摘要:Digi-Q方法通过离线时序差分学习训练基于视觉语言模型的Q函数,无需环境交互即可实现策略改进,在移动设备控制任务中取得了显著性能提升。
English Summary: The Digi-Q approach trains vision-language model-based Q-functions using offline temporal-difference learning to enable policy improvement without environment interaction, achieving significant performance gains in mobile device control tasks.

Authors:Leonardo Berti, Gjergji Kasneci
Title: TLOB: A Novel Transformer Model with Dual Attention for Price Trend Prediction with Limit Order Book Data
Abstract:
Price Trend Prediction (PTP) based on Limit Order Book (LOB) data is a fundamental challenge in financial markets. Despite advances in deep learning, existing models fail to generalize across different market conditions and assets. Surprisingly, by adapting a simple MLP-based architecture to LOB, we show that we surpass SoTA performance; thus, challenging the necessity of complex architectures. Unlike past work that shows robustness issues, we propose TLOB, a transformer-based model that uses a dual attention mechanism to capture spatial and temporal dependencies in LOB data. This allows it to adaptively focus on the market microstructure, making it particularly effective for longer-horizon predictions and volatile market conditions. We also introduce a new labeling method that improves on previous ones, removing the horizon bias. We evaluate TLOB's effectiveness across four horizons, using the established FI-2010 benchmark, a NASDAQ and a Bitcoin dataset. TLOB outperforms SoTA methods in every dataset and horizon. Additionally, we empirically show how stock price predictability has declined over time, -6.68 in F1-score, highlighting the growing market efficiency. Predictability must be considered in relation to transaction costs, so we experimented with defining trends using an average spread, reflecting the primary transaction cost. The resulting performance deterioration underscores the complexity of translating trend classification into profitable trading strategies. We argue that our work provides new insights into the evolving landscape of stock price trend prediction and sets a strong foundation for future advancements in financial AI. We release the code at https://github.com/LeonardoBerti00/TLOB.
中文: TLOB模型通过双注意力机制的Transformer架构,在限价订单簿数据中捕捉时空依赖性,超越了现有最优方法,在不同市场和资产的价格趋势预测中表现卓越。
English: The TLOB model, utilizing a dual attention transformer, surpasses state-of-the-art methods in price trend prediction by effectively capturing spatial and temporal dependencies in limit order book data across various market conditions and assets.

Authors:Joonghyuk Hahn, Hyeseon Ahn, Jungin Kim, Soohan Lim, Yo-Sub Han
Title: TCProF: Time-Complexity Prediction SSL Framework
Abstract:
Time complexity is a theoretic measure to determine the amount of time the algorithm needs for its execution. In reality, developers write algorithms into code snippets within limited resources, making the calculation of a code's time complexity a fundamental task. However, determining the precise time complexity of a code is theoretically undecidable. In response, recent advancements have leaned toward deploying datasets for code time complexity prediction and initiating preliminary experiments for this challenge. We investigate the challenge in low-resource scenarios where only a few labeled instances are given for training. Remarkably, we are the first to introduce TCProF: a Time-Complexity Prediction SSL Framework as an effective solution for code time complexity prediction in low-resource settings. TCProF significantly boosts performance by integrating our augmentation, symbolic modules, and a co-training mechanism, achieving a more than 60% improvement over self-training approaches. We further provide an extensive comparative analysis between TCProF, ChatGPT, and Gemini-Pro, offering a detailed evaluation of our approach. Our code is at https://github.com/peer0/few-shot-tc.
Chinese: TCProF框架通过整合增强模块、符号模块和协同训练机制,有效解决了低资源环境下代码时间复杂度预测的难题,相比自训练方法性能提升超过60%。
English: The TCProF framework addresses the challenge of predicting code time complexity in low-resource settings by integrating augmentation, symbolic modules, and co-training, achieving over 60% improvement compared to self-training methods.

Authors:Sewoong Oh, Himanshu Tyagi, Pramod Viswanath
Title: Training AI to be Loyal
Abstract:
Loyal AI is loyal to the community that builds it. An AI is loyal to a community if the community has ownership, alignment, and control. Community owned models can only be used with the approval of the community and share the economic rewards communally. Community aligned models have values that are aligned with the consensus of the community. Community controlled models perform functions designed by the community. Since we would like permissionless access to the loyal AI's community, we need the AI to be open source. The key scientific question then is: how can we build models that are openly accessible (open source) and yet are owned and governed by the community. This seeming impossibility is the focus of this paper where we outline a concrete pathway to Open, Monetizable and Loyal models (OML), building on our earlier work on OML, arXiv:2411.03887(1) , and a representation via a cryptographic-ML library http://github.com/sentient-agi/oml-1.0-fingerprinting .
中文: 本文提出了一个开发开放、可盈利且忠诚(OML)AI模型的框架,通过密码学与机器学习相结合,实现开源模型在社区所有权、价值观对齐和功能控制下的协同治理。
English: This paper proposes a framework for developing Open, Monetizable, and Loyal (OML) AI models that are both open-source and governed by community ownership, alignment, and control, addressing the challenge through cryptographic-ML integration.

Authors:Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Wei Wang, Xiping Hu, Steven Hoi, Edith Ngai
Title: A Survey on Multimodal Recommender Systems: Recent Advances and Future Directions
Abstract:
Acquiring valuable data from the rapidly expanding information on the internet has become a significant concern, and recommender systems have emerged as a widely used and effective tool for helping users discover items of interest. The essence of recommender systems lies in their ability to predict users' ratings or preferences for various items and subsequently recommend the most relevant ones based on historical interaction data and publicly available information. With the advent of diverse multimedia services, including text, images, video, and audio, humans can perceive the world through multiple modalities. Consequently, a recommender system capable of understanding and interpreting different modal data can more effectively refer to individual preferences. Multimodal Recommender Systems (MRS) not only capture implicit interaction information across multiple modalities but also have the potential to uncover hidden relationships between these modalities. The primary objective of this survey is to comprehensively review recent research advancements in MRS and to analyze the models from a technical perspective. Specifically, we aim to summarize the general process and main challenges of MRS from a technical perspective. We then introduce the existing MRS models by categorizing them into four key areas: Feature Extraction, Encoder, Multimodal Fusion, and Loss Function. Finally, we further discuss potential future directions for developing and enhancing MRS. This survey serves as a comprehensive guide for researchers and practitioners in MRS field, providing insights into the current state of MRS technology and identifying areas for future research. We hope to contribute to developing a more sophisticated and effective multimodal recommender system. To access more details of this paper, we open source a repository: https://github.com/Jinfeng-Xu/Awesome-Multimodal-Recommender-Systems.
Chinese: 本综述全面回顾了多模态推荐系统的最新研究进展,从技术角度分析其框架,将现有模型分为四个关键领域进行分类,并探讨了未来发展方向以提升系统性能。
English: This survey comprehensively reviews recent advances in multimodal recommender systems (MRS), analyzing their technical framework, categorizing models into four key components, and discussing future research directions to enhance their effectiveness.

Authors:Lin Wang, Weisong Wang, Xuanji Xiao, Qing Li
Title: Contrastive Learning Augmented Social Recommendations
Abstract:
Recommender systems play a pivotal role in modern content platforms, yet traditional behavior-based models often face challenges in addressing cold users with sparse interaction data. Engaging these users, however, remains critical for sustaining platform growth. To tackle this issue, we propose leveraging reconstructed social graph to complement interest representations derived from behavioral data. Despite the widespread availability of social graphs on content platforms, their utility is hindered by social-relation noise and inconsistencies between social and behavioral interests. To mitigate noise propagation in graph data and extract reliable social interests, we introduce a dual-view denoising framework. This approach first applies low-rank singular value decomposition (SVD) to the user-item interaction matrix, generating denoised user embeddings for reconstructing the social graph. It then employs contrastive learning to align the original and reconstructed social graphs. To address the discrepancy between social and behavioral interests, we utilize a mutual distillation mechanism that decomposes interests into four subcategories: aligned social/behavioral interests and social/behavioral-specific interests, enabling effective integration of the two. Empirical results demonstrate the efficacy of our method, particularly in improving recommendations for cold users, by combining social and behavioral data. The implementation of our approach is publicly available at https://github.com/WANGLin0126/CLSRec.
Chinese: 本研究提出了一种双视角去噪框架,通过重构社交图谱和对比学习,有效整合社交与行为数据并减少噪声及兴趣差异,显著提升了冷启动用户的推荐效果。
English: This study introduces a dual-view denoising framework that leverages reconstructed social graphs and contrastive learning to enhance recommendations for cold users by effectively integrating social and behavioral data while mitigating noise and interest discrepancies.

Authors:Yu Li, Bryce Wang, Xinyu Luan
Title: XPath Agent: An Efficient XPath Programming Agent Based on LLM for Web Crawler
Abstract:
We present XPath Agent, a production-ready XPath programming agent specifically designed for web crawling and web GUI testing. A key feature of XPath Agent is its ability to automatically generate XPath queries from a set of sampled web pages using a single natural language query. To demonstrate its effectiveness, we benchmark XPath Agent against a state-of-the-art XPath programming agent across a range of web crawling tasks. Our results show that XPath Agent achieves comparable performance metrics while significantly reducing token usage and improving clock-time efficiency. The well-designed two-stage pipeline allows for seamless integration into existing web crawling or web GUI testing workflows, thereby saving time and effort in manual XPath query development. The source code for XPath Agent is available at https://github.com/eavae/feilian.
XPath Agent 是一款可直接投入使用的工具,能通过自然语言自动生成 XPath 查询,在网页爬取和界面测试中不仅达到先进性能水平,还显著降低了令牌消耗并提升了时钟效率。
XPath Agent is a production-ready tool that automatically generates XPath queries from natural language, achieving comparable performance to state-of-the-art agents while reducing token usage and improving efficiency in web crawling and GUI testing workflows.

Authors:Zongkai Zhao, Guozeng Xu, Xiuhua Li, Kaiwen Wei, Jiang Zhong
Title: FLEKE: Federated Locate-then-Edit Knowledge Editing
Abstract:
Locate-then-Edit Knowledge Editing (LEKE) is a key technique for updating large language models (LLMs) without full retraining. However, existing methods assume a single-user setting and become inefficient in real-world multi-client scenarios, where decentralized organizations (e.g., hospitals, financial institutions) independently update overlapping knowledge, leading to redundant mediator knowledge vector (MKV) computations and privacy concerns. To address these challenges, we introduce Federated Locate-then-Edit Knowledge Editing (FLEKE), a novel task that enables multiple clients to collaboratively perform LEKE while preserving privacy and reducing computational overhead. To achieve this, we propose FedEdit, a two-stage framework that optimizes MKV selection and reuse. In the first stage, clients locally apply LEKE and upload the computed MKVs. In the second stage, rather than relying solely on server-based MKV sharing, FLEKE allows clients retrieve relevant MKVs based on cosine similarity, enabling knowledge re-edit and minimizing redundant computations. Experimental results on two benchmark datasets demonstrate that FedEdit retains over 96% of the performance of non-federated LEKE while significantly outperforming a FedAvg-based baseline by approximately twofold. Besides, we find that MEMIT performs more consistently than PMET in the FLEKE task with our FedEdit framework. Our code is available at https://github.com/zongkaiz/FLEKE.
Chinese: FLEKE提出了一种联邦式知识编辑方法,允许多个客户端在保护隐私的同时协作更新大语言模型,并通过优化知识向量复用显著减少冗余计算。
English: FLEKE introduces a federated approach to knowledge editing that enables multiple clients to collaboratively update LLMs while preserving privacy and reducing redundant computations through optimized MKV reuse.

Authors:Florent Bartoccioni, Elias Ramzi, Victor Besnier, Shashanka Venkataramanan, Tuan-Hung Vu, Yihong Xu, Loick Chambon, Spyros Gidaris, Serkan Odabas, David Hurych, Renaud Marlet, Alexandre Boulch, Mickael Chen, Éloi Zablocki, Andrei Bursuc, Eduardo Valle, Matthieu Cord
Title: VaViM and VaVAM: Autonomous Driving through Video Generative Modeling
Abstract:
We explore the potential of large-scale generative video models for autonomous driving, introducing an open-source auto-regressive video model (VaViM) and its companion video-action model (VaVAM) to investigate how video pre-training transfers to real-world driving. VaViM is a simple auto-regressive video model that predicts frames using spatio-temporal token sequences. We show that it captures the semantics and dynamics of driving scenes. VaVAM, the video-action model, leverages the learned representations of VaViM to generate driving trajectories through imitation learning. Together, the models form a complete perception-to-action pipeline. We evaluate our models in open- and closed-loop driving scenarios, revealing that video-based pre-training holds promise for autonomous driving. Key insights include the semantic richness of the learned representations, the benefits of scaling for video synthesis, and the complex relationship between model size, data, and safety metrics in closed-loop evaluations. We release code and model weights at https://github.com/valeoai/VideoActionModel
中文摘要:本研究提出VaViM和VaVAM两种生成式视频模型,构建了从感知到行动的完整自动驾驶流程,证明视频预训练能有效学习驾驶场景语义并通过模仿学习生成行驶轨迹。
English Summary: This study introduces VaViM and VaVAM, two generative video models that form a perception-to-action pipeline for autonomous driving, demonstrating video pre-training's effectiveness in capturing driving semantics and generating trajectories through imitation learning.

Authors:Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali Anwar
Title: Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing
Abstract:
We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at https://github.com/Qi-Le1/Probe_Pruning.
中文: 探针剪枝是一种动态批量剪枝框架,通过选择性探测关键标记和权重来优化大语言模型的结构化剪枝,无需额外模块或微调即可显著提升效率。
English: Probe Pruning is a dynamic, batch-wise framework that enhances structured pruning of Large Language Models by selectively probing key tokens and weights, significantly improving efficiency without extra modules or fine-tuning.

Authors:Xiangtong Yao, Yirui Zhou, Yuan Meng, Liangyu Dong, Lin Hong, Zitao Zhang, Zhenshan Bing, Kai Huang, Fuchun Sun, Alois Knoll
Title: Pick-and-place Manipulation Across Grippers Without Retraining: A Learning-optimization Diffusion Policy Approach
Abstract:
Current robotic pick-and-place policies typically require consistent gripper configurations across training and inference. This constraint imposes high retraining or fine-tuning costs, especially for imitation learning-based approaches, when adapting to new end-effectors. To mitigate this issue, we present a diffusion-based policy with a hybrid learning-optimization framework, enabling zero-shot adaptation to novel grippers without additional data collection for retraining policy. During training, the policy learns manipulation primitives from demonstrations collected using a base gripper. At inference, a diffusion-based optimization strategy dynamically enforces kinematic and safety constraints, ensuring that generated trajectories align with the physical properties of unseen grippers. This is achieved through a constrained denoising procedure that adapts trajectories to gripper-specific parameters (e.g., tool-center-point offsets, jaw widths) while preserving collision avoidance and task feasibility. We validate our method on a Franka Panda robot across six gripper configurations, including 3D-printed fingertips, flexible silicone gripper, and Robotiq 2F-85 gripper. Our approach achieves a 93.3% average task success rate across grippers (vs. 23.3-26.7% for diffusion policy baselines), supporting tool-center-point variations of 16-23.5 cm and jaw widths of 7.5-11.5 cm. The results demonstrate that constrained diffusion enables robust cross-gripper manipulation while maintaining the sample efficiency of imitation learning, eliminating the need for gripper-specific retraining. Video and code are available at https://github.com/yaoxt3/GADP.
中文: 本研究提出了一种基于扩散模型的混合学习优化策略,无需重新训练即可实现零样本适应新型夹具,在多种夹具配置下平均任务成功率高达93.3%。
English: This study introduces a diffusion-based policy with a hybrid learning-optimization framework that enables zero-shot adaptation to novel grippers without retraining, achieving a 93.3% average task success rate across diverse gripper configurations.

Authors:Jixiu Zhai, Zikun Wang, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang
Title: A general language model for peptide identification
Abstract:
Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-convolutional architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses-including dimensionality reduction and comparison studies-PDeepPP demonstrates strong, interpretable peptide representations and achieves state-of-the-art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub:https://github.com/fondress/PDeepPP and Hugging Face:https://huggingface.co/fondress/PDeppPP.
中文: PDeepPP是一个统一的深度学习框架,结合预训练蛋白质语言模型与混合Transformer-卷积架构,在多种生物识别任务中实现了最先进的性能,能准确识别各类生物活性肽和翻译后修饰位点。
English: PDeepPP is a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-convolutional architecture, achieving state-of-the-art performance in identifying diverse bioactive peptides and post-translational modifications across multiple biological tasks.

Authors:Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch
Title: Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning
Abstract:
Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: https://github.com/NJUNLP/context-synthesis.
中文: 研究发现,基于短文本指令微调的大语言模型能有效泛化至长文本处理,并提出通过数据合成框架生成扩展背景语境的新方法,在长文本基准测试中接近人类标注数据的性能。
English: This study finds that instruction-tuning large language models on short contexts enables effective generalization to longer ones and introduces a novel data synthesis framework that generates extended background contexts, achieving near-human performance on long-context benchmarks.

Authors:Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
Title: LightThinker: Thinking Step-by-Step Compression
Abstract:
Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code is released at https://github.com/zjunlp/LightThinker.
中文摘要:LightThinker是一种创新方法,通过将推理过程中的中间思维动态压缩为紧凑表征,在保持准确性的同时显著降低内存使用和推理时间。
English Summary: LightThinker is a novel method that enhances LLM efficiency by dynamically compressing intermediate reasoning steps into compact representations, reducing memory usage and inference time while preserving accuracy.

Authors:Pengcheng Huang, Zhenghao Liu, Yukun Yan, Haiyan Zhao, Xiaoyuan Yi, Hao Chen, Zhiyuan Liu, Maosong Sun, Tong Xiao, Ge Yu, Chenyan Xiong
Title: ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation
Abstract:
Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence. However, they remain susceptible to unfaithful generation, where outputs contradict retrieved context despite its relevance and accuracy. Existing approaches aiming to improve faithfulness primarily focus on enhancing the utilization of external context, but often overlook the persistent influence of internal parametric knowledge during generation. In this work, we investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases. Building on this insight, we propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge. To evaluate our approach, we introduce CoFaithfulQA, a benchmark specifically designed to evaluate faithfulness in scenarios where internal knowledge conflicts with accurate external evidence. Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory. These findings underscore the importance of mitigating internal knowledge dominance and provide a new direction for improving LLM trustworthiness in RAG. All codes are available at https://github.com/OpenBMB/ParamMute.
中文: 本研究提出ParamMute框架,通过抑制大型语言模型中特定前馈网络的激活来降低对内部参数的依赖,从而提升检索证据的忠实度,并在新旧基准测试中取得了显著改进效果。
English: The study introduces ParamMute, a framework that suppresses specific feed-forward networks in large language models to reduce reliance on internal knowledge and enhance faithfulness to retrieved evidence, demonstrating significant improvements on new and existing benchmarks.

Authors:Ragnar Groot Koerkamp
Title: PtrHash: Minimal Perfect Hashing at RAM Throughput
Abstract:
Given a set $K$ of $n$ keys, a minimal perfect hash function (MPHF) is a collision-free bijective map $\mathsf{H_{mphf}}$ from $K$ to $\{0, \dots, n-1\}$. This work presents a (minimal) perfect hash function that first prioritizes query throughput, while also allowing efficient construction for $10^9$ or more elements using 2.4 bits of memory per key. Both PTHash and PHOBIC first map all $n$ keys to $n/λ< n$ buckets. Then, each bucket stores a pilot that controls the final hash value of the keys mapping to it. PtrHash builds on this by using 1) fixed-width (uncompressed) 8-bit pilots, 2) a construction algorithm similar to cuckoo-hashing to find suitable pilot values. Further, it 3) uses the same number of buckets and slots for each part, with 4) a single remap table to map intermediate positions $\geq n$ to $ 中文:PtrHash 是一种高效的最小完美哈希函数,在仅使用每个键 2.0 比特内存的情况下,既能实现快速查询,又支持大规模数据集的快速构建,其流式查询性能接近内存访问极限。
English: PtrHash is a highly efficient minimal perfect hash function that achieves rapid query speeds and supports fast construction for large datasets using only 2.0 bits per key, with streaming queries reaching near-memory-access performance.

Authors:Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li
Title: Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
Abstract:
Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing $\textbf{gradient explosion and dissipation}$. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.
中文: 本文提出的尺度分布解耦方法通过分离全连接层权重矩阵的尺度与分布来稳定大语言模型训练,有效防止梯度爆炸与消散,在不同架构中均表现出优越性能。
English: This paper introduces Scale-Distribution Decoupling (SDD), a lightweight method that stabilizes large language model training by separating weight matrix scale and distribution to prevent gradient issues, showing superior performance across architectures.

Authors:Kai Liu, Dehui Wang, Zhiteng Li, Zheng Chen, Yong Guo, Wenbo Li, Linghe Kong, Yulun Zhang
Title: CondiQuant: Condition Number Based Low-Bit Quantization for Image Super-Resolution
Abstract:
Low-bit model quantization for image super-resolution (SR) is a longstanding task that is renowned for its surprising compression and acceleration ability. However, accuracy degradation is inevitable when compressing the full-precision (FP) model to ultra-low bit widths (2~4 bits). Experimentally, we observe that the degradation of quantization is mainly attributed to the quantization of activation instead of model weights. In numerical analysis, the condition number of weights could measure how much the output value can change for a small change in the input argument, inherently reflecting the quantization error. Therefore, we propose CondiQuant, a condition number based low-bit post-training quantization for image super-resolution. Specifically, we formulate the quantization error as the condition number of weight metrics. By decoupling the representation ability and the quantization sensitivity, we design an efficient proximal gradient descent algorithm to iteratively minimize the condition number and maintain the output still. With comprehensive experiments, we demonstrate that CondiQuant outperforms existing state-of-the-art post-training quantization methods in accuracy without computation overhead and gains the theoretically optimal compression ratio in model parameters. Our code and model are released at https://github.com/Kai-Liu001/CondiQuant.
Chinese: CondiQuant提出了一种基于条件数的图像超分辨率后训练量化方法,通过最小化量化误差,在无计算开销的情况下有效缓解了超低比特模型的精度下降问题。
English: CondiQuant introduces a condition number-based post-training quantization method for image super-resolution, effectively reducing accuracy degradation in ultra-low bit models by minimizing quantization errors without computational overhead.

Authors:Jinda Liu, Yi Chang, Yuan Wu
Title: R-LoRA: Randomized Multi-Head LoRA for Efficient Multi-Task Learning
Abstract:
Fine-tuning large language models (LLMs) is computationally expensive, and Low-Rank Adaptation (LoRA) provides a cost-effective solution by approximating weight updates through low-rank matrices. In real-world scenarios, LLMs are fine-tuned on data from multiple domains to perform tasks across various fields, embodying multi-task learning (MTL). LoRA often underperforms in such complex scenarios. To enhance LoRA's capability in multi-task learning, we propose R-LoRA, which incorporates Multi-Head Randomization. Multi-Head Randomization diversifies the head matrices through Multi-Head Dropout and Multi-Head Random Initialization, enabling more efficient learning of task-specific features while maintaining shared knowledge representation. Our approach not only improves performance in MTL but also reduces GPU memory usage and training time. Experiments show that R-LoRA's gains stem from increased diversity in the head matrices, demonstrating its effectiveness for multi-task learning. The code is available at https://github.com/jinda-liu/R-LoRA
中文: R-LoRA通过引入多头随机化技术增强多头矩阵的多样性,在提升多任务学习性能的同时有效降低了计算资源消耗。
English: R-LoRA enhances multi-task learning by introducing Multi-Head Randomization to diversify head matrices, improving performance while reducing computational costs.

Authors:Yuan Sun
Title: Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models
Abstract:
For pre-training of MoE (Mixture-of-Experts) models, one of the main issues is unbalanced expert loads, which may cause routing collapse or increased computational overhead. Existing methods contain the Loss-Controlled method and the Loss-Free method, where both the unbalanced degrees at first several training steps are still high and decrease slowly. In this work, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q on each MoE layer that can help change the top-K order of s by solving a binary integer programming with very small time costs. We implement the algorithm on two MoE language models: 16-expert (0.3B) and 64-expert (1.1B). The experimental results show that on both models comparing with the Loss-Controlled method and the Loss-Free method, our algorithm trains models with the lowest perplexities, while saves at least 13% of pre-training time compared with the Loss-Controlled method. Within our current knowledge, this is the first routing algorithm that achieves maintaining load balance status on every expert in every MoE layer from the first step to the last step during the whole pre-training process, while the trained MoE models also perform well. The code material of this work is available at https://github.com/sunyuanLLM/bip_routing_algorithm.
中文: 本文提出了一种基于二进制整数规划的专家负载均衡算法,该算法从训练第一步起即可解决MoE模型中专家负载不均衡的问题,在获得最低困惑度的同时,相比现有方法至少节省13%的预训练时间。
English: This paper introduces a BIP-based expert load balancing algorithm that effectively resolves unbalanced expert loads in MoE models from the first training step, achieving the lowest perplexities and reducing pre-training time by at least 13% compared to existing methods.

Authors:Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma
Title: Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning
Abstract:
Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB establishes a new Pareto frontier in the tradeoff between communication and performance, offering an efficient and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at https://github.com/CERT-Lab/fed-sb.
中文: Fed-SB 提出了一种基于 LoRA-SB 的高效联邦微调方法,通过学习小型矩阵实现精确更新,通信成本降低高达 230 倍,并在多项推理任务中达到最优性能。
English: Fed-SB introduces an efficient federated fine-tuning method using LoRA-SB, which reduces communication costs by up to 230x and achieves top performance across reasoning tasks by learning a small matrix for exact updates.

Authors:Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma
Title: Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning
Abstract:
Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB offers a state-of-the-art, efficient, and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at: https://github.com/CERT-Lab/fed-sb.
中文: Fed-SB 提出了一种基于 LoRA-SB 的高效联邦微调方法,通过学习小型矩阵实现精确更新,通信成本降低高达 230 倍,并在多项推理任务中达到最优性能。
English: Fed-SB introduces an efficient federated fine-tuning method using LoRA-SB, which reduces communication costs by up to 230x and achieves top performance across reasoning tasks by learning a small matrix for exact updates.

Authors:Giulio Zizzo, Giandomenico Cornacchia, Kieran Fraser, Muhammad Zaid Hameed, Ambrish Rawat, Beat Buesser, Mark Purcell, Pin-Yu Chen, Prasanna Sattigeri, Kush Varshney
Title: Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Abstract:
As large language models (LLMs) become integrated into everyday applications, ensuring their robustness and security is increasingly critical. In particular, LLMs can be manipulated into unsafe behaviour by prompts known as jailbreaks. The variety of jailbreak styles is growing, necessitating the use of external defences known as guardrails. While many jailbreak defences have been proposed, not all defences are able to handle new out-of-distribution attacks due to the narrow segment of jailbreaks used to align them. Moreover, the lack of systematisation around defences has created significant gaps in their practical application. In this work, we perform systematic benchmarking across 15 different defences, considering a broad swathe of malicious and benign datasets. We find that there is significant performance variation depending on the style of jailbreak a defence is subject to. Additionally, we show that based on current datasets available for evaluation, simple baselines can display competitive out-of-distribution performance compared to many state-of-the-art defences. Code is available at https://github.com/IBM/Adversarial-Prompt-Evaluation.
Chinese: 随着大语言模型面临日益多样化的越狱攻击威胁,本研究系统评估了15种防御方法,发现其性能存在显著差异,并证明简单基线方法在应对分布外攻击时能与前沿防御技术相媲美。
English: As large language models face growing threats from diverse jailbreak attacks, this study systematically benchmarks 15 defense methods and reveals significant performance variations, demonstrating that simple baselines can match state-of-the-art defenses against out-of-distribution attacks.

Authors:Sanghee Park, Geewook Kim
Title: Evaluating Multimodal Generative AI with Korean Educational Standards
Abstract:
This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.
中文: KoNET基准通过韩国四个级别的国家级教育考试,全面评估多模态生成式AI系统在较少研究语言中的表现,涵盖多样化的学科和难度。
English: The KoNET benchmark evaluates multimodal generative AI systems using rigorous Korean national educational tests across four levels to analyze performance in less-explored languages and diverse subjects.

Authors:Xuetao Ma, Wenbin Jiang, Hua Huang
Title: Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning
Abstract:
In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be released at https://github.com/maxuetao/CurriculumICL
中文: 本研究提出了一种基于解题逻辑的课程上下文学习策略,通过分析解题步骤选择示例并按难度排序,有效提升了大型语言模型在复杂推理任务中的表现与效率。
English: This study introduces a curriculum in-context learning strategy that selects and orders demonstration examples based on problem-solving logic and difficulty, significantly improving the performance and efficiency of large language models in complex reasoning tasks.

Authors:Remko Proesmans, Ward Goossens, Lowiek Van den Stockt, Lowie Christiaen, Francis wyffels
Title: Self-Mixing Laser Interferometry for Robotic Tactile Sensing
Abstract:
Self-mixing interferometry (SMI) has been lauded for its sensitivity in detecting microvibrations, while requiring no physical contact with its target. In robotics, microvibrations have traditionally been interpreted as a marker for object slip, and recently as a salient indicator of extrinsic contact. We present the first-ever robotic fingertip making use of SMI for slip and extrinsic contact sensing. The design is validated through measurement of controlled vibration sources, both before and after encasing the readout circuit in its fingertip package. Then, the SMI fingertip is compared to acoustic sensing through four experiments. The results are distilled into a technology decision map. SMI was found to be more sensitive to subtle slip events and significantly more resilient against ambient noise. We conclude that the integration of SMI in robotic fingertips offers a new, promising branch of tactile sensing in robotics. Design and data files are available at https://github.com/RemkoPr/icra2025-SMI-tactile-sensing.
Chinese: 本研究首次利用自混合干涉技术开发了用于检测滑动和外部接触的机器人指尖,相比声学传感,其对细微滑动事件更敏感且抗环境噪声能力显著更强。
English: This study introduces the first robotic fingertip utilizing self-mixing interferometry (SMI) for detecting slip and extrinsic contact, demonstrating superior sensitivity to subtle slip events and greater resilience against ambient noise compared to acoustic sensing.

Authors:Longde Huang, Oleksandr Balabanov, Hampus Linander, Mats Granath, Daniel Persson, Jan E. Gerken
Title: Learning Chern Numbers of Topological Insulators with Gauge Equivariant Neural Networks
Abstract:
Equivariant network architectures are a well-established tool for predicting invariant or equivariant quantities. However, almost all learning problems considered in this context feature a global symmetry, i.e. each point of the underlying space is transformed with the same group element, as opposed to a local ``gauge'' symmetry, where each point is transformed with a different group element, exponentially enlarging the size of the symmetry group. Gauge equivariant networks have so far mainly been applied to problems in quantum chromodynamics. Here, we introduce a novel application domain for gauge-equivariant networks in the theory of topological condensed matter physics. We use gauge equivariant networks to predict topological invariants (Chern numbers) of multiband topological insulators. The gauge symmetry of the network guarantees that the predicted quantity is a topological invariant. We introduce a novel gauge equivariant normalization layer to stabilize the training and prove a universal approximation theorem for our setup. We train on samples with trivial Chern number only but show that our models generalize to samples with non-trivial Chern number. We provide various ablations of our setup. Our code is available at https://github.com/sitronsea/GENet/tree/main.
中文摘要:本研究将规范等变神经网络创新应用于拓扑凝聚态物理领域,通过规范对称性保证预测的拓扑不变量(陈数)准确性,并在仅使用平凡陈数样本训练的情况下实现了对非平凡陈数样本的泛化预测。
English Summary: The study introduces a novel application of gauge-equivariant neural networks in topological condensed matter physics, enabling accurate prediction of Chern numbers for topological insulators with guaranteed topological invariance through gauge symmetry.

Authors:Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, Yi Fang
Title: Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning
Abstract:
Recent advances in large language models (LLMs) have enabled automatic generation of chain-of-thought (CoT) reasoning, leading to strong performance on tasks such as math and code. However, when reasoning steps reflect social stereotypes (e.g., those related to gender, race or age), they can reinforce harmful associations and lead to misleading conclusions. We present the first systematic evaluation of social bias within LLM-generated reasoning, focusing on reasoning language models (e.g., DeepSeek-R1, OpenAI o1) that natively produce reasoning chains as part of their answers. Using the BBQ dataset, we analyze both prediction accuracy and reasoning bias across a broad spectrum of models, including instruction-tuned and CoT-augmented variants of DeepSeek-R1 (8B/32B), ChatGPT, and other open-source LLMs. We quantify how biased reasoning steps correlate with incorrect predictions and often lead to stereotype expression. To mitigate reasoning-induced bias, we propose Answer Distribution as Bias Proxy (ADBP), a lightweight mitigation method that detects bias by tracking how model predictions change across incremental reasoning steps. ADBP outperforms Stereotype-free Reasoning Pattern (SfRP) baseline in most cases, mitigating bias and improving the accuracy of LLM outputs. Evaluation and mitigation code is available at https://github.com/elviswxy/LLM_reasoning_bias.
中文: 大型语言模型的最新进展实现了自动思维链推理,但此类推理可能强化有害的社会刻板印象,导致偏见结论;本研究首次系统评估了推理链中的社会偏见,并提出ADBP这一轻量级缓解方法,通过追踪推理步骤中的预测变化来检测偏见,在多数情况下优于基线方法。
English: Recent advances in large language models enable automatic chain-of-thought reasoning, but such reasoning can reinforce harmful social stereotypes, leading to biased conclusions; this study presents the first systematic evaluation of social bias in reasoning chains and proposes ADBP, a lightweight mitigation method that detects bias through prediction changes across reasoning steps, outperforming baseline approaches.

Authors:Kefan Wang, Hao Wang, Kenan Song, Wei Guo, Kai Cheng, Zhi Li, Yong Liu, Defu Lian, Enhong Chen
Title: A Universal Framework for Compressing Embeddings in CTR Prediction
Abstract:
Accurate click-through rate (CTR) prediction is vital for online advertising and recommendation systems. Recent deep learning advancements have improved the ability to capture feature interactions and understand user interests. However, optimizing the embedding layer often remains overlooked. Embedding tables, which represent categorical and sequential features, can become excessively large, surpassing GPU memory limits and necessitating storage in CPU memory. This results in high memory consumption and increased latency due to frequent GPU-CPU data transfers. To tackle these challenges, we introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings, without sacrificing recommendation quality. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Then, we integrate a contrastive learning mechanism to ensure a uniform distribution of quantized codes, enhancing the distinctiveness of embeddings. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance compared to existing models. The implementation code is accessible in our project repository https://github.com/USTC-StarTeam/MEC.
中文:MEC框架通过量化和对比学习机制压缩嵌入表,在保持推荐质量的同时将内存使用降低超过50倍。
English: The MEC framework effectively compresses embedding tables through quantization and contrastive learning, reducing memory usage by over 50 times while preserving recommendation quality.

Authors:Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen
Title: AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms
Abstract:
Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at https://github.com/microsoft/AttentionEngine.
Chinese: AttentionEngine 是一个综合性框架,可在多样化硬件平台上自动优化注意力机制,实现高达10倍的性能提升,同时极大减少了人工调优需求。
English: AttentionEngine is a comprehensive framework that automates and optimizes attention mechanisms across diverse hardware platforms, achieving up to 10x performance improvements with minimal manual intervention.

Authors:Jinyu Zhang, Chao Li, Zhongying Zhao
Title: Lightweight yet Efficient: An External Attentive Graph Convolutional Network with Positional Prompts for Sequential Recommendation
Abstract:
Graph-based Sequential Recommender systems (GSRs) have gained significant research attention due to their ability to simultaneously handle user-item interactions and sequential relationships between items. Current GSRs often utilize composite or in-depth structures for graph encoding (e.g., the Graph Transformer). Nevertheless, they have high computational complexity, hindering the deployment on resource-constrained edge devices. Moreover, the relative position encoding in Graph Transformer has difficulty in considering the complicated positional dependencies within sequence. To this end, we propose an External Attentive Graph convolutional network with Positional prompts for Sequential recommendation, namely EA-GPS. Specifically, we first introduce an external attentive graph convolutional network that linearly measures the global associations among nodes via two external memory units. Then, we present a positional prompt-based decoder that explicitly treats the absolute item positions as external prompts. By introducing length-adaptive sequential masking and a soft attention network, such a decoder facilitates the model to capture the long-term positional dependencies and contextual relationships within sequences. Extensive experimental results on five real-world datasets demonstrate that the proposed EA-GPS outperforms the state-of-the-art methods. Remarkably, it achieves the superior performance while maintaining a smaller parameter size and lower training overhead. The implementation of this work is publicly available at https://github.com/ZZY-GraphMiningLab/EA-GPS.
中文摘要:提出的EA-GPS模型通过外部注意力图卷积网络和位置提示解码器,解决了图序列推荐系统计算复杂和位置依赖建模的难题,在减少参数量的同时实现了更优性能。
English Summary: The proposed EA-GPS model introduces an external attentive graph convolutional network and positional prompt decoder to address computational complexity and positional dependency limitations in graph-based sequential recommenders, achieving superior performance with reduced parameters.

Authors:Luzhou Ge, Xiangyu Zhu, Zhuo Yang, Xuesong Li
Title: DynamicGSG: Dynamic 3D Gaussian Scene Graphs for Environment Adaptation
Abstract:
In real-world scenarios, environment changes caused by human or agent activities make it extremely challenging for robots to perform various long-term tasks. Recent works typically struggle to effectively understand and adapt to dynamic environments due to the inability to update their environment representations in memory according to environment changes and lack of fine-grained reconstruction of the environments. To address these challenges, we propose DynamicGSG, a dynamic, high-fidelity, open-vocabulary scene graph construction system leveraging Gaussian splatting. DynamicGSG builds hierarchical scene graphs using advanced vision language models to represent the spatial and semantic relationships between objects in the environments, utilizes a joint feature loss we designed to supervise Gaussian instance grouping while optimizing the Gaussian maps, and locally updates the Gaussian scene graphs according to real environment changes for long-term environment adaptation. Experiments and ablation studies demonstrate the performance and efficacy of our proposed method in terms of semantic segmentation, language-guided object retrieval, and reconstruction quality. Furthermore, we validate the dynamic updating capabilities of our system in real laboratory environments. The source code and supplementary experimental materials will be released at:~\href{https://github.com/GeLuzhou/Dynamic-GSG}{https://github.com/GeLuzhou/Dynamic-GSG}.
中文: DynamicGSG是一种利用高斯点云和视觉语言模型构建动态高保真场景图的新系统,通过分层表示和局部更新使机器人能够适应环境变化,从而有效执行长期任务。
English: DynamicGSG is a novel system that constructs dynamic, high-fidelity scene graphs using Gaussian splatting and vision language models, enabling robots to adapt to environmental changes through hierarchical representations and local updates for long-term task performance.

Authors:Jiebin Yan, Ziwen Tan, Yuming Fang, Junjie Chen, Wenhui Jiang, Zhou Wang
Title: Omnidirectional Image Quality Captioning: A Large-scale Database and A New Model
Abstract:
The fast growing application of omnidirectional images calls for effective approaches for omnidirectional image quality assessment (OIQA). Existing OIQA methods have been developed and tested on homogeneously distorted omnidirectional images, but it is hard to transfer their success directly to the heterogeneously distorted omnidirectional images. In this paper, we conduct the largest study so far on OIQA, where we establish a large-scale database called OIQ-10K containing 10,000 omnidirectional images with both homogeneous and heterogeneous distortions. A comprehensive psychophysical study is elaborated to collect human opinions for each omnidirectional image, together with the spatial distributions (within local regions or globally) of distortions, and the head and eye movements of the subjects. Furthermore, we propose a novel multitask-derived adaptive feature-tailoring OIQA model named IQCaption360, which is capable of generating a quality caption for an omnidirectional image in a manner of textual template. Extensive experiments demonstrate the effectiveness of IQCaption360, which outperforms state-of-the-art methods by a significant margin on the proposed OIQ-10K database. The OIQ-10K database and the related source codes are available at https://github.com/WenJuing/IQCaption360.
中文摘要:本研究建立了最大的全景图像质量评估数据库OIQ-10K,并提出IQCaption360模型,能生成质量描述文本,其性能显著优于现有最优方法。
English Summary: This study introduces the largest omnidirectional image quality assessment (OIQA) database, OIQ-10K, and proposes IQCaption360, a novel model that generates quality captions and significantly outperforms existing methods.

Authors:Nie Lin, Takehiko Ohkawa, Yifei Huang, Mingfang Zhang, Minjie Cai, Ming Li, Ryosuke Furuta, Yoichi Sato
Title: SiMHand: Mining Similar Hands for Large-Scale 3D Hand Pose Pre-training
Abstract:
We present a framework for pre-training of 3D hand pose estimation from in-the-wild hand images sharing with similar hand characteristics, dubbed SimHand. Pre-training with large-scale images achieves promising results in various tasks, but prior methods for 3D hand pose pre-training have not fully utilized the potential of diverse hand images accessible from in-the-wild videos. To facilitate scalable pre-training, we first prepare an extensive pool of hand images from in-the-wild videos and design our pre-training method with contrastive learning. Specifically, we collect over 2.0M hand images from recent human-centric videos, such as 100DOH and Ego4D. To extract discriminative information from these images, we focus on the similarity of hands: pairs of non-identical samples with similar hand poses. We then propose a novel contrastive learning method that embeds similar hand pairs closer in the feature space. Our method not only learns from similar samples but also adaptively weights the contrastive learning loss based on inter-sample distance, leading to additional performance gains. Our experiments demonstrate that our method outperforms conventional contrastive learning approaches that produce positive pairs sorely from a single image with data augmentation. We achieve significant improvements over the state-of-the-art method (PeCLR) in various datasets, with gains of 15% on FreiHand, 10% on DexYCB, and 4% on AssemblyHands. Our code is available at https://github.com/ut-vision/SiMHand.
Chinese: 我们提出了SimHand框架,通过利用野外多样手部图像进行预训练,并采用新颖的对比学习方法,在多个数据集上显著超越了现有最先进方法的性能。
English: We introduce SimHand, a framework that enhances 3D hand pose estimation through pre-training on diverse in-the-wild hand images using a novel contrastive learning method, which significantly outperforms existing approaches across multiple datasets.

Authors:Nicholas DiSalvo
Title: Steganographic Embeddings as an Effective Data Augmentation
Abstract:
Image Steganography is a cryptographic technique that embeds secret information into an image, ensuring the hidden data remains undetectable to the human eye while preserving the image's original visual integrity. Least Significant Bit (LSB) Steganography achieves this by replacing the k least significant bits of an image with the k most significant bits of a secret image, maintaining the appearance of the original image while simultaneously encoding the essential elements of the hidden data. In this work, we shift away from conventional applications of steganography in deep learning and explore its potential from a new angle. We present experimental results on CIFAR-10 showing that LSB Steganography, when used as a data augmentation strategy for downstream computer vision tasks such as image classification, can significantly improve the training efficiency of deep neural networks. It can also act as an implicit, uniformly discretized piecewise linear approximation of color augmentations such as (brightness, contrast, hue, and saturation), without introducing additional training overhead through a new joint image training regime that disregards the need for tuning sensitive augmentation hyperparameters.
中文: 本研究证明,最低有效位隐写术可作为计算机视觉任务的有效数据增强方法,在无需额外调整超参数的情况下,既提升了深度神经网络的训练效率,又实现了对色彩增强的隐式分段线性逼近。
English: This study demonstrates that Least Significant Bit Steganography serves as an effective data augmentation method for computer vision tasks, enhancing deep neural network training efficiency while approximating color augmentations without extra hyperparameter tuning.

Authors:Shilong Hou, Ruilin Shang, Zi Long, Xianghua Fu, Yin Chen
Title: A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation
Abstract:
An increasing number of companies have begun providing services that leverage cloud-based large language models (LLMs), such as ChatGPT. However, this development raises substantial privacy concerns, as users' prompts are transmitted to and processed by the model providers. Among the various privacy protection methods for LLMs, those implemented during the pre-training and fine-tuning phrases fail to mitigate the privacy risks associated with the remote use of cloud-based LLMs by users. On the other hand, methods applied during the inference phrase are primarily effective in scenarios where the LLM's inference does not rely on privacy-sensitive information. In this paper, we outline the process of remote user interaction with LLMs and, for the first time, propose a detailed definition of a general pseudonymization framework applicable to cloud-based LLMs. The experimental results demonstrate that the proposed framework strikes an optimal balance between privacy protection and utility. The code for our method is available to the public at https://github.com/Mebymeby/Pseudonymization-Framework.
中文: 本文针对云端大语言模型提出了一种假名化框架,旨在解决用户交互过程中的隐私风险,并在隐私保护与实用性之间实现了最佳平衡。
English: This paper introduces a pseudonymization framework for cloud-based large language models to address privacy risks during user interactions, achieving an optimal balance between privacy protection and utility.

Authors:Mengqiao Liu, Tevin Wang, Cassandra A. Cohen, Sarah Li, Chenyan Xiong
Title: Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews
Abstract:
Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interact with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then be interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, e.g., the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our code and data are at https://github.com/cxcscmu/LLM-Interviewer.
中文: 本文提出CLUE,一个由大语言模型驱动的访谈系统,能在用户与模型交互后即时进行用户体验访谈,并通过海量访谈数据自动分析用户对主流模型(如DeepSeek-R1)的真实看法。
English: This paper introduces CLUE, an LLM-powered interviewer that conducts real-time user experience interviews after interactions with LLMs, automatically extracting insights from large-scale logs to reveal user opinions on mainstream models like DeepSeek-R1.

Authors:Jinchuan Tian, Jiatong Shi, William Chen, Siddhant Arora, Yoshiki Masuyama, Takashi Maekaku, Yihan Wu, Junyi Peng, Shikhar Bharadwaj, Yiwen Zhao, Samuele Cornell, Yifan Peng, Xiang Yue, Chao-Han Huck Yang, Graham Neubig, Shinji Watanabe
Title: ESPnet-SpeechLM: An Open Speech Language Model Toolkit
Abstract:
We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.
中文: ESPnet-SpeechLM 是一个开源工具包,通过标准化工作流程、可配置模块和包括 17 亿参数模型在内的用例,简化了语音语言模型和语音驱动应用的开发。
English: ESPnet-SpeechLM is an open toolkit that simplifies the development of speech language models and voice-driven applications through standardized workflows, configurable modules, and demonstrated use cases, including a 1.7B-parameter model.

Authors:Madhurima Chakraborty, Peter Pirkelbauer, Qing Yi
Title: FormalSpecCpp: A Dataset of C++ Formal Specifications created using LLMs
Abstract:
FormalSpecCpp is a dataset designed to fill the gap in standardized benchmarks for verifying formal specifications in C++ programs. To the best of our knowledge, this is the first comprehensive collection of C++ programs with well-defined preconditions and postconditions. It provides a structured benchmark for evaluating specification inference tools and testing theaccuracy of generated specifications. Researchers and developers can use this dataset to benchmark specification inference tools,fine-tune Large Language Models (LLMs) for automated specification generation, and analyze the role of formal specifications in improving program verification and automated testing. By making this dataset publicly available, we aim to advance research in program verification, specification inference, and AI-assisted software development. The dataset and the code are available at https://github.com/MadhuNimmo/FormalSpecCpp.
中文: FormalSpecCpp是首个包含形式化规范的C++程序综合数据集,旨在为规范推断工具提供基准测试,并推动程序验证和AI辅助开发的研究。
English: FormalSpecCpp is the first comprehensive dataset of C++ programs with formal specifications, designed to benchmark specification inference tools and advance research in program verification and AI-assisted development.

Authors:Yifan Jiang, Yannick Lemaréchal, Sophie Plante, Josée Bafaro, Jessica Abi-Rjeile, Philippe Joubert, Philippe Després, Venkata Manem
Title: Lung-DDPM: Semantic Layout-guided Diffusion Models for Thoracic CT Image Synthesis
Abstract:
With the rapid development of artificial intelligence (AI), AI-assisted medical imaging analysis demonstrates remarkable performance in early lung cancer screening. However, the costly annotation process and privacy concerns limit the construction of large-scale medical datasets, hampering the further application of AI in healthcare. To address the data scarcity in lung cancer screening, we propose Lung-DDPM, a thoracic CT image synthesis approach that effectively generates high-fidelity 3D synthetic CT images, which prove helpful in downstream lung nodule segmentation tasks. Our method is based on semantic layout-guided denoising diffusion probabilistic models (DDPM), enabling anatomically reasonable, seamless, and consistent sample generation even from incomplete semantic layouts. Our results suggest that the proposed method outperforms other state-of-the-art (SOTA) generative models in image quality evaluation and downstream lung nodule segmentation tasks. Specifically, Lung-DDPM achieved superior performance on our large validation cohort, with a Fréchet inception distance (FID) of 0.0047, maximum mean discrepancy (MMD) of 0.0070, and mean squared error (MSE) of 0.0024. These results were 7.4$\times$, 3.1$\times$, and 29.5$\times$ better than the second-best competitors, respectively. Furthermore, the lung nodule segmentation model, trained on a dataset combining real and Lung-DDPM-generated synthetic samples, attained a Dice Coefficient (Dice) of 0.3914 and sensitivity of 0.4393. This represents 8.8% and 18.6% improvements in Dice and sensitivity compared to the model trained solely on real samples. The experimental results highlight Lung-DDPM's potential for a broader range of medical imaging applications, such as general tumor segmentation, cancer survival estimation, and risk prediction. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM/.
中文: 本研究提出Lung-DDPM,一种基于语义布局引导的扩散模型,能生成高保真三维合成CT图像以解决肺癌筛查数据稀缺问题,显著提升了后续肺结节分割任务的性能表现。
English: The study introduces Lung-DDPM, a semantic layout-guided diffusion model that generates high-fidelity 3D synthetic CT images to overcome data scarcity in lung cancer screening, significantly enhancing downstream nodule segmentation performance.

Authors:Jianglin Lu, Yixuan Liu, Yitian Zhang, Yun Fu
Title: Scale-Free Graph-Language Models
Abstract:
Graph-language models (GLMs) have demonstrated great potential in graph-based semi-supervised learning. A typical GLM consists of two key stages: graph generation and text embedding, which are usually implemented by inferring a latent graph and finetuning a language model (LM), respectively. However, the former often relies on artificial assumptions about the underlying edge distribution, while the latter requires extensive data annotations. To tackle these challenges, this paper introduces a novel GLM that integrates graph generation and text embedding within a unified framework. Specifically, for graph generation, we leverage an inherent characteristic of real edge distribution--the scale-free property--as a structural prior. We unexpectedly find that this natural property can be effectively approximated by a simple k-nearest neighbor (KNN) graph. For text embedding, we develop a graph-based pseudo-labeler that utilizes scale-free graphs to provide complementary supervision for improved LM finetuning. Extensive experiments on representative datasets validate our findings on the scale-free structural approximation of KNN graphs and demonstrate the effectiveness of integrating graph generation and text embedding with a real structural prior. Our code is available at https://github.com/Jianglin954/SFGL.
中文: 本文提出了一种统一的图语言模型,利用KNN图近似真实图的尺度无关特性,无需大量标注即可同时改进图生成和文本嵌入。
English: This paper introduces a unified graph-language model that leverages the scale-free property of real graphs, approximated by KNN graphs, to improve both graph generation and text embedding without extensive annotations.

Authors:Luoying Hao, Yan Hu, Yang Yue, Li Wu, Huazhu Fu, Jinming Duan, Jiang Liu
Title: Hierarchical Context Transformer for Multi-level Semantic Scene Understanding
Abstract:
A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition --> step recognition --> action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at https://github.com/Aurora-hao/HCT.
中文摘要:本研究提出了一种分层上下文变换器(HCT)网络,通过多级关系聚合和任务间对比学习实现层次化手术场景理解,在显著降低计算成本的同时大幅超越现有最优方法。
English Summary: This study introduces a hierarchical context transformer (HCT) network for multi-level surgical scene understanding, integrating relation aggregation and contrastive learning to achieve state-of-the-art performance with reduced computational costs.

Authors:Junliang Chen, Huaiyuan Xu, Yi Wang, Lap-Pui Chau
Title: OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework
Abstract:
Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, i.e., OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while improving forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D multi-frame voxels using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training- and inference-friendly. OccProphet reduces 58\%$\sim$78\% of the computational cost with a 2.6$\times$ speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4\%$\sim$18\% relatively higher forecasting accuracy. Code and models are publicly available at https://github.com/JLChen-C/OccProphet.
Chinese: 本文提出OccProphet轻量级框架,在显著降低计算成本的同时,提高了自动驾驶中3D占据预测的准确性。
English: This paper introduces OccProphet, a lightweight framework that significantly reduces computational costs while improving forecasting accuracy for 3D occupancy prediction in autonomous driving.

Authors:Weiqiao Shan, Yuang Li, Yuhao Zhang, Yingfeng Luo, Chen Xu, Xiaofeng Zhao, Long Meng, Yunfei Lu, Min Zhang, Hao Yang, Tong Xiao, Jingbo Zhu
Title: Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders
Abstract:
Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, Speaker Number Verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging. Our code would be available at: https://github.com/shanweiqiao/PaM
中文: 本文提出Prompt-aware Mixture (PaM)方法,通过任务提示从多个音频编码器中提取差异化特征来增强语音大语言模型,在多项任务中超越了单编码器模型及其他特征融合基准。
English: This paper introduces Prompt-aware Mixture (PaM), a method that enhances Speech LLMs by using task-specific prompts to extract distinct audio features from multiple encoders, outperforming single-encoder models and other fusion techniques across various tasks.

Authors:Xiaoyu Chen, Changde Du, Che Liu, Yizhe Wang, Huiguang He
Title: BP-GPT: Auditory Neural Decoding Using fMRI-prompted LLM
Abstract:
Decoding language information from brain signals represents a vital research area within brain-computer interfaces, particularly in the context of deciphering the semantic information from the fMRI signal. Although existing work uses LLM to achieve this goal, their method does not use an end-to-end approach and avoids the LLM in the mapping of fMRI-to-text, leaving space for the exploration of the LLM in auditory decoding. In this paper, we introduce a novel method, the Brain Prompt GPT (BP-GPT). By using the brain representation that is extracted from the fMRI as a prompt, our method can utilize GPT-2 to decode fMRI signals into stimulus text. Further, we introduce the text prompt and align the fMRI prompt to it. By introducing the text prompt, our BP-GPT can extract a more robust brain prompt and promote the decoding of pre-trained LLM. We evaluate our BP-GPT on the open-source auditory semantic decoding dataset and achieve a significant improvement up to 4.61 on METEOR and 2.43 on BERTScore across all the subjects compared to the state-of-the-art method. The experimental results demonstrate that using brain representation as a prompt to further drive LLM for auditory neural decoding is feasible and effective. The code is available at https://github.com/1994cxy/BP-GPT.
中文: 本文提出的BP-GPT方法通过将fMRI信号转换为脑提示来驱动GPT-2进行听觉语义解码,相比现有最佳方法实现了显著性能提升。
English: This paper introduces BP-GPT, an end-to-end method that uses fMRI-derived brain prompts to drive GPT-2 for auditory semantic decoding, achieving significant improvements over state-of-the-art methods.

Authors:Chuan Cui, Kejiang Chen, Zhihua Wei, Wen Shen, Weiming Zhang, Nenghai Yu
Title: M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment
Abstract:
The rapid advancement of AI-generated image (AIGI) models presents new challenges for evaluating image quality, particularly across three aspects: perceptual quality, prompt correspondence, and authenticity. To address these challenges, we introduce M3-AGIQA, a comprehensive framework that leverages Multimodal Large Language Models (MLLMs) to enable more human-aligned, holistic evaluation of AI-generated images across both visual and textual domains. Besides, our framework features a structured multi-round evaluation process, generating and analyzing intermediate image descriptions to provide deeper insight into these three aspects. By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores. Extensive experiments on multiple benchmarks demonstrate that our method achieves state-of-the-art performance on tested datasets and aspects, and exhibits strong generalizability in most cross-dataset settings. Code is available at https://github.com/strawhatboy/M3-AGIQA.
中文: 针对AI生成图像评估的挑战,我们提出M3-AGIQA框架,利用多模态大语言模型进行涵盖感知质量、提示对应性和真实性的全面、人类对齐评估,实现了最先进的性能和强大的泛化能力。
English: To address the challenges in evaluating AI-generated images, we propose M3-AGIQA, a framework using Multimodal Large Language Models for holistic, human-aligned assessment across perceptual quality, prompt correspondence, and authenticity, achieving state-of-the-art performance and strong generalizability.

Authors:Ruofei Bai, Shenghai Yuan, Kun Li, Hongliang Guo, Wei-Yun Yau, Lihua Xie
Title: Realm: Real-Time Line-of-Sight Maintenance in Multi-Robot Navigation with Unknown Obstacles
Abstract:
Multi-robot navigation in complex environments relies on inter-robot communication and mutual observations for coordination and situational awareness. This paper studies the multi-robot navigation problem in unknown environments with line-of-sight (LoS) connectivity constraints. While previous works are limited to known environment models to derive the LoS constraints, this paper eliminates such requirements by directly formulating the LoS constraints between robots from their real-time point cloud measurements, leveraging point cloud visibility analysis techniques. We propose a novel LoS-distance metric to quantify both the urgency and sensitivity of losing LoS between robots considering potential robot movements. Moreover, to address the imbalanced urgency of losing LoS between two robots, we design a fusion function to capture the overall urgency while generating gradients that facilitate robots' collaborative movement to maintain LoS. The LoS constraints are encoded into a potential function that preserves the positivity of the Fiedler eigenvalue of the robots' network graph to ensure connectivity. Finally, we establish a LoS-constrained exploration framework that integrates the proposed connectivity controller. We showcase its applications in multi-robot exploration in complex unknown environments, where robots can always maintain the LoS connectivity through distributed sensing and communication, while collaboratively mapping the unknown environment. The implementations are open-sourced at https://github.com/bairuofei/LoS_constrained_navigation.
Chinese Summary: 本文提出了一种在未知环境中通过实时点云分析和新型视距距离度量来保持多机器人视距连通性的导航系统,实现了机器人在协作探索的同时维持持续通信。
English Summary: This paper introduces a multi-robot navigation system that maintains line-of-sight connectivity in unknown environments through real-time point cloud analysis and a novel LoS-distance metric, enabling collaborative exploration while ensuring continuous communication.

Authors:Tianjie Ju, Bowen Wang, Hao Fei, Mong-Li Lee, Wynne Hsu, Yun Li, Qianren Wang, Pengzhou Cheng, Zongru Wu, Zhuosheng Zhang, Gongshen Liu
Title: Investigating the Adaptive Robustness with Knowledge Conflicts in LLM-based Multi-Agent Systems
Abstract:
Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of corporation and tool use in multi-agent systems (MASs). However, the robustness of these LLM-based MASs, especially under knowledge conflicts, remains unclear. In this paper, we design four comprehensive metrics to investigate the robustness of MASs when facing mild or task-critical knowledge conflicts. We first analyze mild knowledge conflicts introduced by heterogeneous agents and find that they do not harm system robustness but instead improve collaborative decision-making. Next, we investigate task-critical knowledge conflicts by synthesizing knowledge conflicts and embedding them into one of the agents. Our results show that these conflicts have surprisingly little to no impact on MAS robustness. Furthermore, we observe that MASs demonstrate certain self-repairing capabilities by reducing their reliance on knowledge conflicts and adopting alternative solution paths to maintain stability. Finally, we conduct ablation studies on the knowledge conflict number, agent number, and interaction rounds, finding that the self-repairing capability of MASs has intrinsic limits, and all findings hold consistently across various factors. Our code is publicly available at https://github.com/wbw625/MultiAgentRobustness.
中文: 最新研究表明,大语言模型智能体间的普遍分歧通过避免过早共识和拓展解决方案探索来提升集体决策,而任务关键分歧虽严重阻碍推理任务,但因存在替代解决路径对编程影响有限。
English: Recent research demonstrates that general disagreements among large language model agents enhance collective decision-making by preventing premature consensus and expanding solution exploration, whereas task-critical disagreements significantly hinder reasoning tasks but have minimal impact on programming due to alternative solution paths.

Authors:Tianjie Ju, Bowen Wang, Hao Fei, Mong-Li Lee, Wynne Hsu, Yun Li, Qianren Wang, Pengzhou Cheng, Zongru Wu, Haodong Zhao, Zhuosheng Zhang, Gongshen Liu
Title: When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements
Abstract:
Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of cooperation and tool use in multi-agent systems (MAS). However, it remains unclear how disagreements shape collective decision-making. In this paper, we revisit the role of disagreement and argue that general, partially overlapping disagreements prevent premature consensus and expand the explored solution space, while disagreements on task-critical steps can derail collaboration depending on the topology of solution paths. We investigate two collaborative settings with distinct path structures: collaborative reasoning (CounterFact, MQuAKE-cf), which typically follows a single evidential chain, whereas collaborative programming (HumanEval, GAIA) often adopts multiple valid implementations. Disagreements are instantiated as general heterogeneity among agents and as task-critical counterfactual knowledge edits injected into context or parameters. Experiments reveal that general disagreements consistently improve success by encouraging complementary exploration. By contrast, task-critical disagreements substantially reduce success on single-path reasoning, yet have a limited impact on programming, where agents can choose alternative solutions. Trace analyses show that MAS frequently bypasses the edited facts in programming but rarely does so in reasoning, revealing an emergent self-repair capability that depends on solution-path rather than scale alone. Our code is available at https://github.com/wbw625/MultiAgentRobustness.
中文: 最新研究表明,大语言模型智能体间的普遍分歧通过避免过早共识和拓展解决方案探索来提升集体决策,而任务关键分歧虽严重阻碍推理任务,但因存在替代解决路径对编程影响有限。
English: Recent research demonstrates that general disagreements among large language model agents enhance collective decision-making by preventing premature consensus and expanding solution exploration, whereas task-critical disagreements significantly hinder reasoning tasks but have minimal impact on programming due to alternative solution paths.

Authors:Ebenezer Tarubinga, Jenifer Kalafatovich, Seong-Whan Lee
Title: CW-BASS: Confidence-Weighted Boundary-Aware Learning for Semi-Supervised Semantic Segmentation
Abstract:
Semi-supervised semantic segmentation (SSSS) aims to improve segmentation performance by utilizing large amounts of unlabeled data with limited labeled samples. Existing methods often suffer from coupling, where over-reliance on initial labeled data leads to suboptimal learning; confirmation bias, where incorrect predictions reinforce themselves repeatedly; and boundary blur caused by limited boundary-awareness and ambiguous edge cues. To address these issues, we propose CW-BASS, a novel framework for SSSS. In order to mitigate the impact of incorrect predictions, we assign confidence weights to pseudo-labels. Additionally, we leverage boundary-delineation techniques, which, despite being extensively explored in weakly-supervised semantic segmentation (WSSS), remain underutilized in SSSS. Specifically, our method: (1) reduces coupling via a confidence-weighted loss that adjusts pseudo-label influence based on their predicted confidence scores, (2) mitigates confirmation bias with a dynamic thresholding mechanism that learns to filter out pseudo-labels based on model performance, (3) tackles boundary blur using a boundary-aware module to refine segmentation near object edges, and (4) reduces label noise through a confidence decay strategy that progressively refines pseudo-labels during training. Extensive experiments on Pascal VOC 2012 and Cityscapes demonstrate that CW-BASS achieves state-of-the-art performance. Notably, CW-BASS achieves a 65.9% mIoU on Cityscapes under a challenging and underexplored 1/30 (3.3%) split (100 images), highlighting its effectiveness in limited-label settings. Our code is available at https://github.com/psychofict/CW-BASS.
中文: CW-BASS框架通过引入置信度加权的伪标签和边界描绘技术,有效解决了半监督语义分割中的耦合、确认偏差和边界模糊问题,在有限标注数据下于多个基准数据集上取得了领先性能。
English: The CW-BASS framework enhances semi-supervised semantic segmentation by introducing confidence-weighted pseudo-labeling and boundary-delineation techniques to address coupling, confirmation bias, and boundary blur issues, achieving state-of-the-art results on benchmark datasets with limited labeled data.

Authors:Yen-Che Hsiao, Abhishek Dutta
Title: Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps
Abstract:
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data, including GPT2, SmolLM2, OpenELM, TinyLlama, Stable LM, and Gemma 2. We identify a critical parameter threshold (~1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning. Specifically, models above this threshold achieve better success rates in chain-of-thought (CoT) prompting for deductive reasoning tasks, especially those requiring longer reasoning chains, such as proof by contradiction and disjunction elimination. To address limitations in sub-threshold models, we demonstrate that fine-tuning with task-specific exemplars substantially enhances reasoning performance, enabling accurate CoT generation even without additional exemplars in the prompt for tasks with shorter reasoning chains. Finally, our analysis of attention maps reveals that models capable of generating correct CoTs exhibit higher token-level attention scores on subsequent correct tokens and the correct parts of speech, providing interpretability insights into reasoning processes. These findings collectively advance understanding of reasoning capabilities in decoder-only transformer-based models. The code is available at: https://github.com/AnnonymousForPapers/CoT_Reasoning_Test.
中文: 本研究发现仅解码器Transformer语言模型需达到约16亿参数的关键阈值才能显著提升推理能力,注意力图分析为思维链生成过程提供了可解释性依据。
English: This study reveals that decoder-only transformer language models require a critical parameter threshold of approximately 1.6 billion to achieve significant reasoning improvements, with attention map analysis providing interpretability for chain-of-thought generation processes.

Authors:Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Kibum Kim, Chanyoung Park
Title: Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
Abstract:
As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SAFEBENCH, the first benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at https://github.com/yeonjun-in/U-SafeBench.
中文: 本研究提出了首个评估大语言模型用户特定安全性的基准U-SAFEBENCH,发现现有模型无法满足个性化安全标准,并通过思维链方法提出了有效的改进方案。
English: The study introduces U-SAFEBENCH, the first benchmark evaluating user-specific safety in LLMs, revealing current models' failure to meet personalized safety standards and proposing an effective chain-of-thought mitigation strategy.

Authors:Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal
Title: UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
Abstract:
User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model's representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. Across three standard unlearning methods, UPCORE consistently achieves a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. Our results show that UPCORE improves both standard metrics and AUC, benefiting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.
中文:UPCORE框架通过选择性修剪待遗忘数据中的异常值,在不同遗忘方法中实现了删除效果与模型效用保持的最佳平衡。
English: The proposed UPCORE framework selectively prunes outliers from data to be forgotten, effectively balancing deletion efficacy and model utility preservation across various unlearning methods.

Authors:Mohsen Hariri, Alan Luo, Mohammadreza Nemati, Lam Nguyen, Shaochen Zhong, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary
Title: Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values
Abstract:
Large Language Models (LLMs) have introduced significant advancements to the capabilities of Natural Language Processing (NLP) in recent years. However, as these models continue to scale in size, memory constraints pose substantial challenge. Key and Value cache (KV cache) quantization has been well-documented as a promising solution to this limitation. In this work, we provide two novel theorems aimed at enhancing KV quantization methods. Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information compared to the value weight matrices, as evidenced by higher spectral and Frobenius norms across most of the layers. Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance. In particular, assigning greater precision to the keys compared to the values achieves a higher degree of precision reduction with minimal impact on model accuracy. We validate these theorems through theory and extensive experiments on several state-of-the-art LLM architectures and benchmarks. These findings offer valuable guidelines for improving KV cache quantization strategies, facilitating more efficient memory utilization without compromising model performance across diverse NLP tasks. Source code is available at https://github.com/mohsenhariri/spectral-kv.
中文: 本文针对大语言模型中的KV缓存量化提出两项创新理论,证明在量化过程中优先处理键而非值能在保持模型精度的同时显著提升内存使用效率。
English: This paper introduces two novel theorems for enhancing KV cache quantization in large language models, demonstrating that prioritizing key quantization over value quantization improves memory efficiency while maintaining model accuracy.

Authors:Tong Zhao, Yozen Liu, Matthew Kolodner, Kyle Montemayor, Elham Ghazizadeh, Ankit Batra, Zihao Fan, Xiaobin Gao, Xuan Guo, Jiwen Ren, Serim Park, Peicheng Yu, Jun Yu, Shubham Vij, Neil Shah
Title: GiGL: Large-Scale Graph Neural Networks at Snapchat
Abstract:
Recent advances in graph machine learning (ML) with the introduction of Graph Neural Networks (GNNs) have led to a widespread interest in applying these approaches to business applications at scale. GNNs enable differentiable end-to-end (E2E) learning of model parameters given graph structure which enables optimization towards popular node, edge (link) and graph-level tasks. While the research innovation in new GNN layers and training strategies has been rapid, industrial adoption and utility of GNNs has lagged considerably due to the unique scale challenges that large-scale graph ML problems create. In this work, we share our approach to training, inference, and utilization of GNNs at Snapchat. To this end, we present GiGL (Gigantic Graph Learning), an open-source library to enable large-scale distributed graph ML to the benefit of researchers, ML engineers, and practitioners. We use GiGL internally at Snapchat to manage the heavy lifting of GNN workflows, including graph data preprocessing from relational DBs, subgraph sampling, distributed training, inference, and orchestration. GiGL is designed to interface cleanly with open-source GNN modeling libraries prominent in academia like PyTorch Geometric (PyG), while handling scaling and productionization challenges that make it easier for internal practitioners to focus on modeling. GiGL is used in multiple production settings, and has powered over 35 launches across multiple business domains in the last 2 years in the contexts of friend recommendation, content recommendation and advertising. This work details high-level design and tools the library provides, scaling properties, case studies in diverse business settings with industry-scale graphs, and several key lessons learned in employing graph ML at scale on large social data. GiGL is open-sourced at https://github.com/Snapchat/GiGL.
中文: 图神经网络(GNNs)的最新进展激发了商业应用兴趣,但工业应用因规模挑战而滞后,因此Snapchat开发了开源库GiGL,以支持大规模分布式图机器学习在生产中的运用。
English: Recent advances in Graph Neural Networks (GNNs) have spurred interest in business applications, but industrial adoption lags due to scaling challenges, leading to the development of GiGL, an open-source library by Snapchat that facilitates large-scale distributed graph machine learning for production use.

Authors:Anthony Fuller, Yousef Yassin, Daniel G. Kyrollos, Evan Shelhamer, James R. Green
Title: Simpler Fast Vision Transformers with a Jumbo CLS Token
Abstract:
We introduce a simple enhancement of vision transformers (ViTs) to improve accuracy while maintaining throughput. Our approach, Jumbo, creates a wider CLS token, which is split to match the patch token width before attention, processed with self-attention, and reassembled. After attention, Jumbo applies a dedicated, wider FFN to this token. Since there is only one Jumbo token, its cost is minimal, and because we share this FFN across layers, its parameter count is controlled. Jumbo significantly improves over ViT+Registers on ImageNet-1K and ImageNet-21K. These gains are largest at small sizes / high speeds, e.g., ViT-nano+Jumbo outperforms ViT-nano+Registers by 13%. In fact, our Jumbo models are so efficient that they outperform specialized compute-efficient models while preserving the architectural advantages of plain ViTs, such as support for token dropping and other modalities. Accordingly, we demonstrate that Jumbo excels in these two settings via masked autoencoding and on a suite of time series benchmarks. Code and weights available: https://github.com/antofuller/jumbo
中文: Jumbo通过拓宽CLS标记并应用专用前馈网络来增强视觉变换器,以极低成本显著提升精度,尤其在小规模高速模型中表现突出。
English: Jumbo enhances vision transformers by widening the CLS token and applying a dedicated feed-forward network, significantly improving accuracy with minimal cost, especially in small, high-speed models.

Authors:Manuel Knott, Ignacio Serna, Ethan Mann, Pietro Perona
Title: A Rapid Test for Accuracy and Bias of Face Recognition Technology
Abstract:
Measuring the accuracy of face recognition (FR) systems is essential for improving performance and ensuring responsible use. Accuracy is typically estimated using large annotated datasets, which are costly and difficult to obtain. We propose a novel method for 1:1 face verification that benchmarks FR systems quickly and without manual annotation, starting from approximate labels (e.g., from web search results). Unlike previous methods for training set label cleaning, ours leverages the embedding representation of the models being evaluated, achieving high accuracy in smaller-sized test datasets. Our approach reliably estimates FR accuracy and ranking, significantly reducing the time and cost of manual labeling. We also introduce the first public benchmark of five FR cloud services, revealing demographic biases, particularly lower accuracy for Asian women. Our rapid test method can democratize FR testing, promoting scrutiny and responsible use of the technology. Our method is provided as a publicly accessible tool at https://github.com/caltechvisionlab/frt-rapid-test
中文: 本文提出了一种无需人工标注的快速人脸识别系统评测方法,利用近似标签和模型嵌入向量,在显著降低测试成本的同时揭示了商业服务中存在的种族与性别偏见。
English: This paper introduces a rapid, annotation-free method for benchmarking face recognition systems using approximate labels and model embeddings, significantly reducing testing costs while revealing demographic biases in commercial services.

Authors:Masatoshi Uehara, Xingyu Su, Yulai Zhao, Xiner Li, Aviv Regev, Shuiwang Ji, Sergey Levine, Tommaso Biancalani
Title: Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design
Abstract:
To fully leverage the capabilities of diffusion models, we are often interested in optimizing downstream reward functions during inference. While numerous algorithms for reward-guided generation have been recently proposed due to their significance, current approaches predominantly focus on single-shot generation, transitioning from fully noised to denoised states. We propose a novel framework for inference-time reward optimization with diffusion models inspired by evolutionary algorithms. Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising. This sequential refinement allows for the gradual correction of errors introduced during reward optimization. Besides, we provide a theoretical guarantee for our framework. Finally, we demonstrate its superior empirical performance in protein and cell-type-specific regulatory DNA design. The code is available at \href{https://github.com/masa-ue/ProDifEvo-Refinement}{https://github.com/masa-ue/ProDifEvo-Refinement}.
中文摘要:本文提出了一种受进化算法启发的创新框架,通过在扩散模型推理过程中采用迭代式的加噪和奖励引导去噪步骤,实现误差的渐进修正,并在生物序列设计中展现出卓越性能。
English Summary: This paper introduces a novel evolutionary algorithm-inspired framework for optimizing reward functions during diffusion model inference, featuring iterative noising and reward-guided denoising steps that enable gradual error correction and demonstrate superior performance in biological sequence design.

Authors:Thomas Froech, Olaf Wysocki, Yan Xia, Junyu Xie, Benedikt Schwab, Daniel Cremers, Thomas H. Kolbe
Title: FacaDiffy: Inpainting Unseen Facade Parts Using Diffusion Models
Abstract:
High-detail semantic 3D building models are frequently utilized in robotics, geoinformatics, and computer vision. One key aspect of creating such models is employing 2D conflict maps that detect openings' locations in building facades. Yet, in reality, these maps are often incomplete due to obstacles encountered during laser scanning. To address this challenge, we introduce FacaDiffy, a novel method for inpainting unseen facade parts by completing conflict maps with a personalized Stable Diffusion model. Specifically, we first propose a deterministic ray analysis approach to derive 2D conflict maps from existing 3D building models and corresponding laser scanning point clouds. Furthermore, we facilitate the inpainting of unseen facade objects into these 2D conflict maps by leveraging the potential of personalizing a Stable Diffusion model. To complement the scarcity of real-world training data, we also develop a scalable pipeline to produce synthetic conflict maps using random city model generators and annotated facade images. Extensive experiments demonstrate that FacaDiffy achieves state-of-the-art performance in conflict map completion compared to various inpainting baselines and increases the detection rate by $22\%$ when applying the completed conflict maps for high-definition 3D semantic building reconstruction. The code is be publicly available in the corresponding GitHub repository: https://github.com/ThomasFroech/InpaintingofUnseenFacadeObjects
中文摘要:FacaDiffy提出了一种利用个性化稳定扩散模型补全二维冲突地图的创新方法,用于高细节三维建筑重建,实现了最先进的性能,并将检测率提高了22%。
English Summary: FacaDiffy introduces a novel method using personalized Stable Diffusion models to complete incomplete 2D conflict maps for high-detail 3D building reconstruction, achieving state-of-the-art performance with a 22% detection rate improvement.

Authors:Zizhuo Zhang, Lijun Wu, Kaiyuan Gao, Jiangchao Yao, Tao Qin, Bo Han
Title: Fast and Accurate Blind Flexible Docking
Abstract:
Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand and pocket docking modules, enabling continuous structural refinements. This approach effectively integrates the three subtasks of blind flexible docking-pocket identification, ligand conformation prediction, and protein flexibility modeling-into a unified, coherent framework. Extensive experiments on public benchmark datasets demonstrate that FABFlex not only achieves superior effectiveness in predicting accurate binding modes but also exhibits a significant speed advantage (208 $\times$) compared to existing state-of-the-art methods. Our code is released at https://github.com/tmlr-group/FABFlex.
中文: FABFlex是一种快速准确的回归式多任务学习模型,通过整合口袋预测、配体对接和口袋对接模块,并采用迭代更新机制解决盲柔性对接难题,在预测精度上表现卓越且计算速度比现有方法快208倍。
English: FABFlex is a fast and accurate regression-based multi-task learning model that integrates pocket prediction, ligand docking, and pocket docking with an iterative update mechanism to address challenges in blind flexible docking, achieving superior accuracy and a 208× speed advantage over existing methods.

Authors:Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng
Title: SIFT: Grounding LLM Reasoning in Contexts via Stickers
Abstract:
This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase "10 dollars per kilo," LLMs might not recognize that "per" means "for each," leading to calculation errors. We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the *Sticker*, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions -- one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model's inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to **85.67**%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.
中文: 本文提出SIFT方法,通过模型自生成的"贴标"将推理过程锚定于上下文,有效解决了大型语言模型在推理中的语境误解问题,并在多个基准测试中显著提升了性能。
English: This paper introduces SIFT, a post-training method that uses self-generated "Stickers" to enhance LLM reasoning by grounding it in context, significantly improving performance across various models and benchmarks.

Authors:Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, Cheng Yang
Title: PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths
Abstract:
Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known asgraph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions. The code is available at the following link: https://github.com/BUPT-GAMMA/PathRAG
Chinese: PathRAG通过从图索引中提取关键关系路径,利用基于流的剪枝减少冗余信息,并采用基于路径的提示方法提升生成响应的逻辑性,在多个数据集和评估维度上均优于现有先进方法。
English: PathRAG enhances retrieval-augmented generation by extracting key relational paths from a graph index, reducing redundancy through flow-based pruning and improving response coherence with path-based prompting, outperforming existing methods across multiple datasets and evaluation dimensions.

Authors:Zhe Huang, Shuo Wang, Yongcai Wang, Lei Wang
Title: CoDiff: Conditional Diffusion Model for Collaborative 3D Object Detection
Abstract:
Collaborative 3D object detection holds significant importance in the field of autonomous driving, as it greatly enhances the perception capabilities of each individual agent by facilitating information exchange among multiple agents. However, in practice, due to pose estimation errors and time delays, the fusion of information across agents often results in feature representations with spatial and temporal noise, leading to detection errors. Diffusion models naturally have the ability to denoise noisy samples to the ideal data, which motivates us to explore the use of diffusion models to address the noise problem between multi-agent systems. In this work, we propose CoDiff, a novel robust collaborative perception framework that leverages the potential of diffusion models to generate more comprehensive and clearer feature representations. To the best of our knowledge, this is the first work to apply diffusion models to multi-agent collaborative perception. Specifically, we project high-dimensional feature map into the latent space of a powerful pre-trained autoencoder. Within this space, individual agent information serves as a condition to guide the diffusion model's sampling. This process denoises coarse feature maps and progressively refines the fused features. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework CoDiff consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose and delay information of agents is with high-level noise. The code is released at https://github.com/HuangZhe885/CoDiff
中文摘要:CoDiff是一种创新的协作感知框架,利用扩散模型对多智能体系统中的噪声特征进行去噪和优化,显著提升了自动驾驶中3D物体检测的准确性和对姿态误差及延迟的鲁棒性。
English Summary: Collaborative 3D object detection in autonomous driving is enhanced by CoDiff, a novel framework that uses diffusion models to denoise and refine multi-agent features, improving detection accuracy and robustness against pose and delay errors.

Authors:Zhiyu Zhu, Zhibo Jin, Jiayu Zhang, Nan Yang, Jiahao Huang, Jianlong Zhou, Fang Chen
Title: Narrowing Information Bottleneck Theory for Multimodal Image-Text Representations Interpretability
Abstract:
The task of identifying multimodal image-text representations has garnered increasing attention, particularly with models such as CLIP (Contrastive Language-Image Pretraining), which demonstrate exceptional performance in learning complex associations between images and text. Despite these advancements, ensuring the interpretability of such models is paramount for their safe deployment in real-world applications, such as healthcare. While numerous interpretability methods have been developed for unimodal tasks, these approaches often fail to transfer effectively to multimodal contexts due to inherent differences in the representation structures. Bottleneck methods, well-established in information theory, have been applied to enhance CLIP's interpretability. However, they are often hindered by strong assumptions or intrinsic randomness. To overcome these challenges, we propose the Narrowing Information Bottleneck Theory, a novel framework that fundamentally redefines the traditional bottleneck approach. This theory is specifically designed to satisfy contemporary attribution axioms, providing a more robust and reliable solution for improving the interpretability of multimodal models. In our experiments, compared to state-of-the-art methods, our approach enhances image interpretability by an average of 9%, text interpretability by an average of 58.83%, and accelerates processing speed by 63.95%. Our code is publicly accessible at https://github.com/LMBTough/NIB.
Chinese: 提出的“窄化信息瓶颈理论”通过重新定义传统瓶颈方法,显著提升了CLIP模型的可解释性,在图像和文本理解方面取得重大改进,同时大幅提高了处理速度。
English: The proposed Narrowing Information Bottleneck Theory enhances CLIP's interpretability by redefining traditional bottleneck methods, achieving significant improvements in image and text interpretability along with faster processing speeds.

Authors:Insu Han, Zeliang Zhang, Zhiyuan Wang, Yifan Zhu, Susan Liang, Jiani Liu, Haiting Lin, Mingjie Zhao, Chenliang Xu, Kun Wan, Wentian Zhao
Title: CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance across diverse applications. However, their computational overhead during deployment remains a critical bottleneck. While Key-Value (KV) caching effectively trades memory for computation to enhance inference efficiency, the growing memory footprint from extensive KV caches significantly reduces throughput and restricts prolonged deployment on memory-constrained GPU devices. To address this challenge, we propose CalibQuant, a simple yet highly effective visual quantization strategy that drastically reduces both memory and computational overhead. Specifically, CalibQuant introduces an extreme 1-bit quantization scheme, complemented by novel post-scaling and calibration techniques tailored to the intrinsic patterns of KV caches, thereby ensuring high efficiency without compromising model performance. Leveraging Triton for runtime optimization, we achieve a 10x throughput increase on InternVL models. Our method is designed to be plug-and-play, seamlessly integrating with various existing MLLMs without requiring architectural changes. Extensive experiments confirm that our approach significantly reduces memory usage while maintaining computational efficiency and preserving multimodal capabilities. Codes are available at https://github.com/insuhan/calibquant.
中文:CalibQuant提出了一种极端的1位量化策略,通过定制化校准技术显著降低多模态大语言模型的内存和计算开销,在保持性能的同时实现了10倍吞吐量提升,且无需修改模型架构。
English: CalibQuant introduces an extreme 1-bit quantization strategy with specialized calibration techniques to drastically reduce memory and computational overhead in Multimodal Large Language Models while maintaining performance, achieving a 10x throughput increase without architectural modifications.

Authors:Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, Dacheng Tao
Title: A Survey of Safety on Large Vision-Language Models: Attacks, Defenses and Evaluations
Abstract:
With the rapid advancement of Large Vision-Language Models (LVLMs), ensuring their safety has emerged as a crucial area of research. This survey provides a comprehensive analysis of LVLM safety, covering key aspects such as attacks, defenses, and evaluation methods. We introduce a unified framework that integrates these interrelated components, offering a holistic perspective on the vulnerabilities of LVLMs and the corresponding mitigation strategies. Through an analysis of the LVLM lifecycle, we introduce a classification framework that distinguishes between inference and training phases, with further subcategories to provide deeper insights. Furthermore, we highlight limitations in existing research and outline future directions aimed at strengthening the robustness of LVLMs. As part of our research, we conduct a set of safety evaluations on the latest LVLM, Deepseek Janus-Pro, and provide a theoretical analysis of the results. Our findings provide strategic recommendations for advancing LVLM safety and ensuring their secure and reliable deployment in high-stakes, real-world applications. This survey aims to serve as a cornerstone for future research, facilitating the development of models that not only push the boundaries of multimodal intelligence but also adhere to the highest standards of security and ethical integrity. Furthermore, to aid the growing research in this field, we have created a public repository to continuously compile and update the latest work on LVLM safety: https://github.com/XuankunRong/Awesome-LVLM-Safety .
中文摘要:本综述通过统一框架全面分析大型视觉语言模型的安全性,涵盖攻击、防御和评估方法,同时提供对Deepseek Janus-Pro的安全评估,并指出加强模型鲁棒性的未来研究方向。
English Summary: This survey comprehensively analyzes the safety of Large Vision-Language Models by examining attacks, defenses, and evaluation methods through a unified framework, while also providing safety assessments of Deepseek Janus-Pro and outlining future research directions to enhance model robustness.

Authors:Dong Chen, Zhengqing Hu, Peiguang Fan, Yueting Zhuang, Yafei Li, Qidong Liu, Xiaoheng Jiang, Mingliang Xu
Title: KKA: Improving Vision Anomaly Detection through Anomaly-related Knowledge from Large Language Models
Abstract:
Vision anomaly detection, particularly in unsupervised settings, often struggles to distinguish between normal samples and anomalies due to the wide variability in anomalies. Recently, an increasing number of studies have focused on generating anomalies to help detectors learn more effective boundaries between normal samples and anomalies. However, as the generated anomalies are often derived from random factors, they frequently lack realism. Additionally, randomly generated anomalies typically offer limited support in constructing effective boundaries, as most differ substantially from normal samples and lie far from the boundary. To address these challenges, we propose Key Knowledge Augmentation (KKA), a method that extracts anomaly-related knowledge from large language models (LLMs). More specifically, KKA leverages the extensive prior knowledge of LLMs to generate meaningful anomalies based on normal samples. Then, KKA classifies the generated anomalies as easy anomalies and hard anomalies according to their similarity to normal samples. Easy anomalies exhibit significant differences from normal samples, whereas hard anomalies closely resemble normal samples. KKA iteratively updates the generated anomalies, and gradually increasing the proportion of hard anomalies to enable the detector to learn a more effective boundary. Experimental results show that the proposed method significantly improves the performance of various vision anomaly detectors while maintaining low generation costs. The code for CMG can be found at https://github.com/Anfeather/KKA.
中文: 提出的关键知识增强方法利用大型语言模型生成与正常样本高度相似的逼真困难异常,通过迭代优化帮助视觉异常检测器学习更有效的边界,从而显著提升检测性能。
English: The proposed Key Knowledge Augmentation (KKA) method leverages large language models to generate realistic hard anomalies that closely resemble normal samples, iteratively refining them to help vision anomaly detectors learn more effective boundaries and significantly improve performance.

Authors:Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han
Title: LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Abstract:
Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks, yet efficiently serving these models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context and reasoning capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.
中文: LServe系统通过混合稀疏注意力机制高效加速长序列大语言模型服务,在预填充阶段提速最高达2.9倍、解码阶段提速1.3-2.1倍,同时通过跳过次要令牌计算和动态剪枝KV缓存页保持了长上下文处理精度。
English: LServe is an efficient system that accelerates long-sequence LLM serving through hybrid sparse attention, achieving up to 2.9x faster prefilling and 1.3-2.1x faster decoding while maintaining accuracy by skipping computations on less important tokens and dynamically pruning KV pages.

Authors:Sara Ghaboura, Ketan More, Ritesh Thawkar, Wafa Alghallabi, Omkar Thawakar, Fahad Shahbaz Khan, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Title: Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
Abstract:
Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models' capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. Our code is available at: \url{https://github.com/mbzuai-oryx/TimeTravel}.
中文摘要:TimeTravel基准提供了一个涵盖多文化历史文物的标准化数据集和评估框架,旨在推动人工智能在文化遗产分析和历史研究中的应用,确保技术进步对历史发现与保护做出实质性贡献。
English Summary: The TimeTravel benchmark introduces a comprehensive dataset and evaluation framework to enhance AI models' capabilities in analyzing historical artifacts, aiming to integrate AI as a reliable tool for cultural heritage preservation and historical research.

Authors:Yuming Yang, Jiang Zhong, Li Jin, Jingwang Huang, Jingpeng Gao, Qing Liu, Yang Bai, Jingyuan Zhang, Rui Jiang, Kaiwen Wei
Title: Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework
Abstract:
Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at https://github.com/Nomothings/CHARGE.git.
中文摘要:本文提出了基于图表的多模态检索增强生成新任务,通过CHARGE框架生成评估数据并构建Chart-MRAG基准测试集,揭示了现有方法在图表检索性能差和文本模态偏好等关键缺陷。
English Summary: This paper introduces Chart-based MRAG, a novel task addressing the gap in multimodal benchmarks by proposing the CHARGE framework to generate evaluation data and constructing Chart-MRAG Bench, revealing critical limitations in current methods including poor chart retrieval performance and text-biased reasoning.

Authors:Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
Title: FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Abstract:
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.
Chinese: FR-Spec提出了一种基于频率排序的推测采样框架,通过压缩词汇空间来加速大语言模型生成,在保持输出质量的同时,将计算开销降低75%,并比现有最优方法提速1.12倍。
English: FR-Spec introduces a frequency-ranked speculative sampling framework that accelerates large language model generation by compressing vocabulary space, achieving a 75% reduction in computational overhead and a 1.12× speedup over existing methods while maintaining output quality.

Authors:Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica
Title: Prompt-to-Leaderboard
Abstract:
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot on the Chatbot Arena leaderboard. Our code is available on GitHub at https://github.com/lmarena/p2l.
Chinese: 作者提出了Prompt-to-Leaderboard(P2L)方法,通过训练大语言模型根据自然语言提示预测人类偏好,生成针对特定提示的排行榜,从而实现个性化模型评估与查询路由,该方法优于传统聚合指标,并于2025年1月在Chatbot Arena排行榜上获得首位。
English: The authors introduce Prompt-to-Leaderboard (P2L), a method that generates prompt-specific leaderboards by training an LLM to predict human preferences, enabling personalized model evaluation and routing, which outperforms traditional aggregated metrics and achieved top ranking on Chatbot Arena in January 2025.

Authors:Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu
Title: GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks
Abstract:
Large Language Models (LLMs) have shown great promise in tool-making, yet existing frameworks often struggle to efficiently construct reliable toolsets and are limited to single-task settings. To address these challenges, we propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that dynamically constructs and evolves a hierarchical graph of reusable tools across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, TabMWP). Our results show that GATE achieves up to 4.3x faster milestone completion in Minecraft compared to the previous SOTA, and provides an average improvement of 9.23% over existing tool-making methods in code generation tasks and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, balancing tool quantity, complexity, and functionality while maintaining high efficiency. Code and data are available at \url{https://github.com/ayanami2003/GATE}.
中文: GATE框架通过动态构建和演化分层工具图,在多场景任务中实现了比现有方法更优异的性能表现,在代码生成和智能体任务中分别获得平均9.23%和10.03%的性能提升。
English: The GATE framework dynamically constructs and evolves hierarchical tool graphs across multiple scenarios, achieving significant performance improvements in open-ended, agent-based, and code generation tasks compared to existing methods.

Authors:Alexia Jolicoeur-Martineau, Yan Zhang, Boris Knyazev, Aristide Baratin, Cheng-Hao Liu
Title: Generating $π$-Functional Molecules Using STGG+ with Active Learning
Abstract:
Generating novel molecules with out-of-distribution properties is a major challenge in molecular discovery. While supervised learning methods generate high-quality molecules similar to those in a dataset, they struggle to generalize to out-of-distribution properties. Reinforcement learning can explore new chemical spaces but often conducts 'reward-hacking' and generates non-synthesizable molecules. In this work, we address this problem by integrating a state-of-the-art supervised learning method, STGG+, in an active learning loop. Our approach iteratively generates, evaluates, and fine-tunes STGG+ to continuously expand its knowledge. We denote this approach STGG+AL. We apply STGG+AL to the design of organic $π$-functional materials, specifically two challenging tasks: 1) generating highly absorptive molecules characterized by high oscillator strength and 2) designing absorptive molecules with reasonable oscillator strength in the near-infrared (NIR) range. The generated molecules are validated and rationalized in-silico with time-dependent density functional theory. Our results demonstrate that our method is highly effective in generating novel molecules with high oscillator strength, contrary to existing methods such as reinforcement learning (RL) methods. We open-source our active-learning code along with our Conjugated-xTB dataset containing 2.9 million $π$-conjugated molecules and the function for approximating the oscillator strength and absorption wavelength (based on sTDA-xTB).
中文: 本研究提出STGG+AL方法,通过将监督学习与主动学习循环结合,能有效生成具有高振荡器强度的新型分子,其性能优于传统强化学习方法。
English: The study introduces STGG+AL, an active learning approach that integrates supervised learning with iterative fine-tuning to effectively generate novel molecules with high oscillator strength, outperforming traditional reinforcement learning methods.

Authors:Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li
Title: LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Abstract:
Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V
中文:现有大型视觉语言模型因缺乏长输出示例而难以生成连贯长文本,我们通过LongWriter-V-22k数据集和IterDPO方法提升了生成长度与保真度,使7B参数模型在性能上超越GPT-4o等大型专有模型。
English: Existing LVLMs struggle with long coherent outputs due to lacking long output examples in SFT, so we introduce LongWriter-V-22k dataset and IterDPO method to enhance generation fidelity and length, achieving superior performance with a 7B model over larger proprietary models.

Authors:Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, Aliaksandr Siarohin
Title: Improving the Diffusability of Autoencoders
Abstract:
Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to 20K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K $256^2$ and FVD by at least 44% for video generation on Kinetics-700 $17 \times 256^2$. The source code is available at https://github.com/snap-research/diffusability.
中文: 潜在扩散模型因自编码器潜在空间中的过度高频分量而影响生成质量,通过提出的尺度等变性正则化方法,仅需少量调整即可显著提升图像和视频生成的性能指标。
English: Latent diffusion models face quality issues due to high-frequency components in autoencoder latent spaces, which are mitigated by a proposed scale equivariance regularization that significantly improves image and video generation metrics with minimal adjustments.

Authors:Danni Liu, Jan Niehues
Title: Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs
Abstract:
While large language models demonstrate remarkable capabilities at task-specific applications through fine-tuning, extending these benefits across diverse languages is essential for broad accessibility. However, effective cross-lingual transfer is hindered by LLM performance gaps across languages and the scarcity of fine-tuning data in many languages. Through analysis of LLM internal representations from over 1,000+ language pairs, we discover that middle layers exhibit the strongest potential for cross-lingual alignment. Building on this finding, we propose a middle-layer alignment objective integrated into task-specific training. Our experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, especially to lower-resource languages. The method is robust to the choice of alignment languages and generalizes to languages unseen during alignment. Furthermore, we show that separately trained alignment modules can be merged with existing task-specific modules, improving cross-lingual capabilities without full re-training. Our code is publicly available (https://github.com/dannigt/mid-align).
中文摘要:本研究提出一种中间层对齐方法,通过利用来自1000多种语言对的内部表征来增强语言模型的跨语言迁移能力,在多项任务中尤其是低资源语言上展现出持续改进效果。
English Summary: This study introduces a middle-layer alignment method that enhances cross-lingual transfer in language models by leveraging internal representations from over 1,000 language pairs, demonstrating consistent improvements across multiple tasks especially for lower-resource languages.

Authors:Maor Mizrachi, Barak Raveh, Elad Steinberg
Title: MadVoro: Parallel Construction of Voronoi Diagrams in Distributed Memory Systems
Abstract:
Voronoi diagrams are essential geometrical structures with numerous applications, particularly astrophysics-driven finite volume methods. While serial algorithms for constructing these entities are well-established, parallel construction remains challenging. This is especially true in distributed memory systems, where each host manages only a subset of the input points. This process requires redistributing points across hosts and accurately computing the corresponding Voronoi cells. In this paper, we introduce a new distributed construction algorithm, which is implemented in our open-source C++ 3-dimensional Voronoi construction framework. Our approach leverages Delaunay triangulation as an intermediate step, which is then transformed into a Voronoi diagram. We introduce the algorithms we implemented for the precise construction and our load-balancing approach and compare the running time with other state-of-the-art frameworks. MadVoro is a versatile tool that can be applied in various scientific domains, such as mesh decomposition, computational physics, chemistry, and machine learning.
Chinese: 本文提出了一种新的分布式算法,通过Delaunay三角剖分作为中间步骤来构建三维Voronoi图,并在开源框架MadVoro中实现了负载均衡技术,同时与现有方法进行了性能对比。
English: This paper presents a new distributed algorithm for constructing 3D Voronoi diagrams using Delaunay triangulation as an intermediate step, implemented in the open-source framework MadVoro with load-balancing techniques and performance comparisons to existing methods.

Authors:Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Yang Liu, Jing Lin, Yiwu Yao, Rongrong Ji
Title: Dynamic Low-Rank Sparse Adaptation for Large Language Models
Abstract:
Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$\%$, achieving a 2.60$\times$ speedup on CPU and 2.23$\times$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.
中文: LoSA是一种创新方法,通过将低秩适应与大型语言模型稀疏化在统一框架中结合,在微调时动态调整稀疏度和秩,从而提升性能且不增加推理延迟。
English: LoSA is a novel method that integrates low-rank adaptation with LLM sparsity in a unified framework, enhancing performance without increasing inference latency by dynamically adjusting sparsity and rank during fine-tuning.

Authors:Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su
Title: From RAG to Memory: Non-Parametric Continual Learning for Large Language Models
Abstract:
Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential. Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information. However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory. Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity. However, their performance on more basic factual memory tasks drops considerably below standard RAG. We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities. This work paves the way for non-parametric continual learning for LLMs. Code and data are available at https://github.com/OSU-NLP-Group/HippoRAG.
Chinese: HippoRAG 2框架通过深度融合段落分析和优化大语言模型在线使用,在事实记忆、意义构建和联想记忆任务上全面超越标准检索增强生成方法,为实现大语言模型的非参数持续学习开辟了新途径。
English: The HippoRAG 2 framework significantly outperforms standard retrieval-augmented generation by integrating deeper passage analysis and enhanced LLM utilization, achieving superior performance in factual, sense-making, and associative memory tasks while advancing non-parametric continual learning for AI systems.

Authors:Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, Xiaohua Zhai
Title: SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Abstract:
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe -- this includes captioning-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input's native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fairness. To allow users to trade off inference cost with performance, we release model checkpoints at four sizes: ViT-B (86M), L (303M), So400m (400M), and g (1B).
中文: SigLIP 2 作为新一代多语言视觉语言编码器,融合了多种先进技术,在核心能力、定位任务和多语言公平性方面均超越前代,并提供多种可扩展的模型尺寸。
English: SigLIP 2 is an enhanced multilingual vision-language encoder that integrates multiple advanced techniques to surpass its predecessor in core capabilities, localization tasks, and multilingual fairness while offering scalable model sizes.

Authors:Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa
Title: Harnessing PDF Data for Improving Japanese Large Multimodal Models
Abstract:
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.
Chinese: 本研究开发了一种从日语PDF中自动提取图文对的流程,利用这一未充分开发的资源显著提升了日语大型多模态模型的性能,在基准测试中实现了最高13.8%的性能提升。
English: This study introduces an automated pipeline to extract image-text pairs from Japanese PDFs, significantly enhancing the performance of Japanese Large Multimodal Models by leveraging this underutilized resource and achieving up to 13.8% improvement on benchmarks.

Authors:Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han
Title: Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis
Abstract:
With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.
中文摘要:Tree-of-Debate框架将科学论文转化为大语言模型角色进行结构化辩论,通过动态构建辩论树来分析论文的创新性并对比研究成果,有效辅助研究者进行跨领域的文献综述。
English Summary: The Tree-of-Debate framework transforms scientific papers into LLM personas that engage in structured debates to analyze novelty and contrast findings, aiding researchers in literature reviews across various domains.

Authors:Maya Varma, Ashwin Kumar, Rogier van der Sluijs, Sophie Ostmeier, Louis Blankemeier, Pierre Chambon, Christian Bluethgen, Jip Prince, Curtis Langlotz, Akshay Chaudhari
Title: MedVAE: Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders
Abstract:
Medical images are acquired at high resolutions with large fields of view in order to capture fine-grained features necessary for clinical decision-making. Consequently, training deep learning models on medical images can incur large computational costs. In this work, we address the challenge of downsizing medical images in order to improve downstream computational efficiency while preserving clinically-relevant features. We introduce MedVAE, a family of six large-scale 2D and 3D autoencoders capable of encoding medical images as downsized latent representations and decoding latent representations back to high-resolution images. We train MedVAE autoencoders using a novel two-stage training approach with 1,052,730 medical images. Across diverse tasks obtained from 20 medical image datasets, we demonstrate that (1) utilizing MedVAE latent representations in place of high-resolution images when training downstream models can lead to efficiency benefits (up to 70x improvement in throughput) while simultaneously preserving clinically-relevant features and (2) MedVAE can decode latent representations back to high-resolution images with high fidelity. Our work demonstrates that large-scale, generalizable autoencoders can help address critical efficiency challenges in the medical domain. Our code is available at https://github.com/StanfordMIMI/MedVAE.
中文: MedVAE提出了一系列自编码器,可将医学图像压缩为高效潜在表征,在保持临床特征和高保真重建的同时,实现高达70倍的计算吞吐量提升。
English: MedVAE introduces a family of autoencoders that compress medical images into efficient latent representations, enabling up to 70x computational throughput improvement while preserving clinical features and high-fidelity reconstruction.

Authors:Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun
Title: TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
Abstract:
Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at https://github.com/thunlp/TritonBench.
中文: TritonBench作为首个全面的Triton代码生成基准,不仅评估功能正确性还关注性能效率,揭示了当前大型语言模型在生成优化GPU算子方面存在显著不足。
English: TritonBench is introduced as the first comprehensive benchmark for evaluating Triton code generation, focusing on both functional correctness and efficiency performance, revealing that current LLMs struggle to produce optimized GPU operators.

Authors:Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue
Title: HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
Abstract:
The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.
中文: 研究发现大型视觉语言模型在处理不安全内容时会产生独特的内部激活模式,据此开发的HiddenDetect框架无需调优即可利用这些模式有效检测和防御多模态越狱攻击。
English: This study reveals that large vision-language models exhibit distinct internal activation patterns when processing unsafe content, leading to the development of HiddenDetect—a tuning-free framework that leverages these patterns to effectively detect and mitigate multimodal jailbreak attacks.

Authors:Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong
Title: Group-Level Data Selection for Efficient Pretraining
Abstract:
In this paper, we introduce Group-MATES, an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining. Specifically, Group-MATES parameterizes costly group-level selection with a relational data influence model. To train this model, we sample training trajectories of the language model and collect oracle data influences alongside. The relational data influence model approximates the oracle data influence by weighting individual influence with relationships among training data. To enable efficient selection with our relational data influence model, we partition the dataset into small clusters using relationship weights and select data within each cluster independently. Experiments on DCLM 400M-4x, 1B-1x, and 3B-1x show that Group-MATES achieves 3.5%-9.4% relative performance gains over random selection across 22 downstream tasks, nearly doubling the improvements achieved by state-of-the-art individual data selection baselines. Furthermore, Group-MATES reduces the number of tokens required to reach a certain downstream performance by up to 1.75x, substantially elevating the speed-quality frontier. Further analyses highlight the critical role of relationship weights in the relational data influence model and the effectiveness of our cluster-based inference. Our code is open-sourced at https://github.com/facebookresearch/Group-MATES.
Chinese: Group-MATES 提出了一种基于关系数据影响模型的高效群组级数据选择方法,用于优化语言模型预训练的速度与质量边界,实验表明其性能显著提升且所需标记数量最多减少1.75倍。
English: Group-MATES introduces an efficient group-level data selection method using a relational data influence model to optimize language model pretraining, achieving significant performance gains and reducing token requirements by up to 1.75x in experiments.

Authors:Daphne Cornelisse, Aarav Pandya, Kevin Joseph, Joseph Suárez, Eugene Vinitsky
Title: Building reliable sim driving agents by scaling self-play
Abstract:
Simulation agents are essential for designing and testing systems that interact with humans, such as autonomous vehicles (AVs). These agents serve various purposes, from benchmarking AV performance to stress-testing system limits, but all applications share one key requirement: reliability. To enable sound experimentation, a simulation agent must behave as intended. It should minimize actions that may lead to undesired outcomes, such as collisions, which can distort the signal-to-noise ratio in analyses. As a foundation for reliable sim agents, we propose scaling self-play to thousands of scenarios on the Waymo Open Motion Dataset under semi-realistic limits on human perception and control. Training from scratch on a single GPU, our agents solve almost the full training set within a day. They generalize to unseen test scenes, achieving a 99.8% goal completion rate with less than 0.8% combined collision and off-road incidents across 10,000 held-out scenarios. Beyond in-distribution generalization, our agents show partial robustness to out-of-distribution scenes and can be fine-tuned in minutes to reach near-perfect performance in such cases. We open-source the pre-trained agents and integrate them with a batched multi-agent simulator. Demonstrations of agent behaviors can be viewed at https://sites.google.com/view/reliable-sim-agents, and we open-source our agents at https://github.com/Emerge-Lab/gpudrive.
中文: 仿真智能体对于可靠的人机交互系统测试至关重要,我们通过自博弈训练的智能体实现了高目标达成率与极低事故率,展现出强大的泛化能力和快速适应性能。
English: Simulation agents are crucial for reliable human-interactive system testing, and our self-play trained agents achieve high goal completion with minimal incidents, demonstrating robust generalization and fast adaptability.

Authors:Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu
Title: I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search
Abstract:
Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low-diversity and suboptimal code generation. While recent work has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making process. Furthermore, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node's solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from LLM-estimated scores to actual performance scores. This allows higher-quality nodes to be traversed earlier. Applied to the various ML tasks, our approach demonstrates a 6% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems. Resource available at https://github.com/jokieleung/I-MCTS
Chinese: 本研究提出内省蒙特卡洛树搜索(I-MCTS),通过节点内省优化和混合奖励机制提升自动化机器学习决策能力,相比现有AutoML代理实现性能绝对提升6%。
English: This study introduces Introspective Monte Carlo Tree Search (I-MCTS), which enhances decision-making in automated machine learning by refining nodes through introspection and employing a hybrid rewarding mechanism, achieving a 6% performance improvement over existing AutoML agents.

Authors:Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu
Title: Length-Controlled Margin-Based Preference Optimization without Reference Model
Abstract:
Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions. However, DPO is hindered by several limitations, including length bias, memory inefficiency, and probability degradation. To address these challenges, we propose Length-Controlled Margin-Based Preference Optimization (LMPO), a more efficient and robust alternative. LMPO introduces a uniform reference model as an upper bound for the DPO loss, enabling a more accurate approximation of the original optimization objective. Additionally, an average log-probability optimization strategy is employed to minimize discrepancies between training and inference phases. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within the Bradley-Terry framework. This loss function regulates response length while simultaneously widening the margin between preferred and rejected outputs. By doing so, it mitigates probability degradation for both accepted and discarded responses, addressing a significant limitation of existing methods. We evaluate LMPO against state-of-the-art preference optimization techniques on two open-ended large language models, Mistral and LLaMA3, across six conditional benchmarks. Our experimental results demonstrate that LMPO effectively controls response length, reduces probability degradation, and outperforms existing approaches. The code is available at https://github.com/gengxuli/LMPO.
中文: 作者提出了长度控制边际偏好优化(LMPO)方法,通过引入统一参考模型和创新的损失函数来克服直接偏好优化的缺陷,有效控制响应长度并减少概率衰减,在Mistral和LLaMA3模型上的多项测试中表现出更优性能。
English: The authors introduce Length-Controlled Margin-Based Preference Optimization (LMPO) to overcome Direct Preference Optimization's limitations by using a uniform reference model and a novel loss function that controls response length and reduces probability degradation, showing superior performance on Mistral and LLaMA3 models across multiple benchmarks.

Authors:Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber
Title: NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
Abstract:
Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.
Chinese: 本研究提出了Navig图像地理定位框架,利用高质量数据集NaviClues通过语言推理增强分析能力,仅需少量训练样本即可将平均距离误差降低14%,优于现有最优模型。
English: The study introduces Navig, a novel image geo-localization framework that leverages a high-quality dataset called NaviClues to enhance reasoning with language, reducing average distance error by 14% over previous models with minimal training data.

Authors:Angxiao Yue, Zichong Wang, Hongteng Xu
Title: ReQFlow: Rectified Quaternion Flow for Efficient and High-Quality Protein Backbone Generation
Abstract:
Protein backbone generation plays a central role in de novo protein design and is significant for many biological and medical applications. Although diffusion and flow-based generative models provide potential solutions to this challenging task, they often generate proteins with undesired designability and suffer computational inefficiency. In this study, we propose a novel rectified quaternion flow (ReQFlow) matching method for fast and high-quality protein backbone generation. In particular, our method generates a local translation and a 3D rotation from random noise for each residue in a protein chain, which represents each 3D rotation as a unit quaternion and constructs its flow by spherical linear interpolation (SLERP) in an exponential format. We train the model by quaternion flow (QFlow) matching with guaranteed numerical stability and rectify the QFlow model to accelerate its inference and improve the designability of generated protein backbones, leading to the proposed ReQFlow model. Experiments show that ReQFlow achieves on-par performance in protein backbone generation while requiring much fewer sampling steps and significantly less inference time (e.g., being 37x faster than RFDiffusion and 63x faster than Genie2 when generating a backbone of length 300), demonstrating its effectiveness and efficiency. The code is available at https://github.com/AngxiaoYue/ReQFlow.
中文摘要:本研究提出ReQFlow方法,通过修正四元数流匹配技术,利用单位四元数高效建模三维旋转,实现了快速高质量的蛋白质骨架生成,在保持同等性能的同时大幅提升了生成效率。
English Summary: The study introduces ReQFlow, a rectified quaternion flow matching method that enables fast and high-quality protein backbone generation by efficiently modeling 3D rotations with unit quaternions, achieving comparable performance while being significantly faster than existing methods.

Authors:Yuguo Yin, Yuxin Xie, Wenyuan Yang, Dongchao Yang, Jinghan Ru, Xianwei Zhuang, Liming Liang, Yuexian Zou
Title: ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors
Abstract:
Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical weight error upper bound for quantifying the inconsistency. Based on the analysis of the weight error upper bound, we find that the inconsistency problem stems from the data distribution error caused by random sampling of languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets show that our scheme achieves state-of-the-art performance on recall and consistency metrics for eight mainstream languages, including English. Our code will be available at https://github.com/ATRI-ACL/ATRI-ACL.
中文: 本文提出了一种采用1对多对比学习和音频-英语共锚对比学习的一致性多语言音频文本检索方案,解决了跨语言相似性匹配不一致问题,在八种语言的召回率和一致性指标上达到了最优性能。
English: This paper proposes a consistent multilingual audio-text retrieval scheme using 1-to-k contrastive learning and audio-English co-anchor learning to address cross-language similarity matching inconsistencies, achieving state-of-the-art performance on recall and consistency metrics across eight languages.

Authors:Jiangyuan Liu, Hongxuan Ma, Yuxin Guo, Yuhao Zhao, Chi Zhang, Wei Sui, Wei Zou
Title: Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion
Abstract:
Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single-image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input. Codes and models are publicly available at https://github.com/L-J-Yuan/MODEST.
Chinese Summary: 本文提出一种单目视觉框架,首次仅用单张RGB图像就实现了透明物体的精确分割与深度估计,通过多尺度融合模块和迭代优化策略,在合成与真实数据集上以38.8%-46.2%的显著优势超越现有最佳方法。
English Summary: This paper introduces a monocular framework that simultaneously excels in transparent object segmentation and depth estimation using only a single RGB image, achieving a 38.8%-46.2% performance improvement over existing methods through novel multi-scale fusion and iterative refinement strategies.

Authors:Chentao Cao, Zhun Zhong, Zhanke Zhou, Tongliang Liu, Yang Liu, Kun Zhang, Bo Han
Title: Noisy Test-Time Adaptation in Vision-Language Models
Abstract:
Test-time adaptation (TTA) aims to address distribution shifts between source and target data by relying solely on target data during testing. In open-world scenarios, models often encounter noisy samples, i.e., samples outside the in-distribution (ID) label space. Leveraging the zero-shot capability of pre-trained vision-language models (VLMs), this paper introduces Zero-Shot Noisy TTA (ZS-NTTA), focusing on adapting the model to target data with noisy samples during test-time in a zero-shot manner. We find existing TTA methods underperform under ZS-NTTA, often lagging behind even the frozen model. We conduct comprehensive experiments to analyze this phenomenon, revealing that the negative impact of unfiltered noisy data outweighs the benefits of clean data during model updating. Also, adapting a classifier for ID classification and noise detection hampers both sub-tasks. Built on this, we propose a framework that decouples the classifier and detector, focusing on developing an individual detector while keeping the classifier frozen. Technically, we introduce the Adaptive Noise Detector (AdaND), which utilizes the frozen model's outputs as pseudo-labels to train a noise detector. To handle clean data streams, we further inject Gaussian noise during adaptation, preventing the detector from misclassifying clean samples as noisy. Beyond the ZS-NTTA, AdaND can also improve the zero-shot out-of-distribution (ZS-OOD) detection ability of VLMs. Experiments show that AdaND outperforms in both ZS-NTTA and ZS-OOD detection. On ImageNet, AdaND achieves a notable improvement of $8.32\%$ in harmonic mean accuracy ($\text{Acc}_\text{H}$) for ZS-NTTA and $9.40\%$ in FPR95 for ZS-OOD detection, compared to SOTA methods. Importantly, AdaND is computationally efficient and comparable to the model-frozen method. The code is publicly available at: https://github.com/tmlr-group/ZS-NTTA.
中文: 本文提出零样本噪声测试时适应方法(ZS-NTTA),通过设计自适应噪声检测器(AdaND)框架,将分类器与检测器解耦,在保持计算效率的同时有效处理测试过程中的噪声样本。
English: This paper introduces Zero-Shot Noisy Test-Time Adaptation (ZS-NTTA), proposing the Adaptive Noise Detector (AdaND) framework that decouples classifier and detector functions to effectively handle noisy samples during testing while maintaining computational efficiency.

Authors:Rongzhen Wang, Yan Zhang, Chenyu Zheng, Chongxuan Li, Guoqiang Wu
Title: A Theory for Conditional Generative Modeling on Multiple Data Sources
Abstract:
The success of large generative models has driven a paradigm shift, leveraging massive multi-source data to enhance model capabilities. However, the interaction among these sources remains theoretically underexplored. This paper takes the first step toward a rigorous analysis of multi-source training in conditional generative modeling, where each condition represents a distinct data source. Specifically, we establish a general distribution estimation error bound in average total variation distance for conditional maximum likelihood estimation based on the bracketing number. Our result shows that when source distributions share certain similarities and the model is expressive enough, multi-source training guarantees a sharper bound than single-source training. We further instantiate the general theory on conditional Gaussian estimation and deep generative models including autoregressive and flexible energy-based models, by characterizing their bracketing numbers. The results highlight that the number of sources and similarity among source distributions improve the advantage of multi-source training. Simulations and real-world experiments are conducted to validate the theory, with code available at: https://github.com/ML-GSAI/Multi-Source-GM.
中文总结:本文首次对条件生成模型中的多源训练进行了严格理论分析,证明当源分布具有相似性且模型表达能力足够时,多源训练能获得比单源训练更优的误差界限。
English Summary: This paper provides the first rigorous theoretical analysis of multi-source training in conditional generative models, demonstrating that shared similarities among source distributions and model expressiveness yield sharper error bounds than single-source training.

Authors:Shiqi Zhang, Xinbei Ma, Zouying Cao, Zhuosheng Zhang, Hai Zhao
Title: Plan-over-Graph: Towards Parallelable LLM Agent Schedule
Abstract:
Large Language Models (LLMs) have demonstrated exceptional abilities in reasoning for task planning. However, challenges remain under-explored for parallel schedules. This paper introduces a novel paradigm, plan-over-graph, in which the model first decomposes a real-life textual task into executable subtasks and constructs an abstract task graph. The model then understands this task graph as input and generates a plan for parallel execution. To enhance the planning capability of complex, scalable graphs, we design an automated and controllable pipeline to generate synthetic graphs and propose a two-stage training scheme. Experimental results show that our plan-over-graph method significantly improves task performance on both API-based LLMs and trainable open-sourced LLMs. By normalizing complex tasks as graphs, our method naturally supports parallel execution, demonstrating global efficiency. The code and data are available at https://github.com/zsq259/Plan-over-Graph.
中文摘要:本文提出“图规划”新范式,通过将任务分解为可执行的子任务图结构,使大语言模型能够实现并行执行,显著提升了各类模型的任务处理性能。
English Summary: This paper introduces a "plan-over-graph" paradigm where LLMs decompose tasks into executable subtasks through graph structures, enabling parallel execution and significantly improving task performance across various models.

Authors:Eric Egli, Matteo Manica, Jannis Born
Title: Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling
Abstract:
Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: https://github.com/ai4sd/multiscale-byte-lm
中文摘要:多尺度字节语言模型(MBLM)提出分层解码器架构,可在单GPU上高效处理百万字节序列训练,通过纯下一词元预测在多模态任务中实现与定制模型相媲美的性能,无需专用编码器。
English Summary: The Multiscale Byte Language Model (MBLM) introduces a hierarchical decoder architecture enabling efficient training on million-byte sequences with standard GPUs, demonstrating competitive performance in multimodal tasks through pure next-token prediction without specialized encoders.

Authors:Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu
Title: LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization
Abstract:
Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), enable efficient adaptation of large language models (LLMs) via low-rank matrix optimization with frozen weights. However, LoRA typically exhibits "double descent" in training loss as rank increases, characterized by a three-phase dynamics: initial convergence, transient divergence, and eventual stabilization. This non-monotonic behavior delays convergence and impairs generalization through unstable gradients and attraction to sharp minima. To address these challenges, we propose LoRA-MGPO, a novel LoRA-based framework incorporating Momentum-Guided Perturbation Optimization (MGPO). First, MGPO eliminates Sharpness-Aware Minimization (SAM)'s dual gradient computations by reusing momentum vectors from optimizer states to guide perturbation directions. This retains SAM's training stability and flat minima preference with maintained efficiency. Second, MGPO incorporates adaptive perturbation normalization, scaling perturbation intensity via exponential moving average (EMA)-smoothed gradient magnitudes. Experiments on natural language understanding and generation benchmarks demonstrate that LoRA-MGPO outperforms LoRA and state-of-the-art PEFT methods. Further analysis confirms its ability to stabilize training and reduce sharp minima attraction, with smoother loss curves and improved convergence behavior. The code is available at https://github.com/llm172/LoRA-MGPO
中文: 提出的LoRA-MGPO框架通过动量引导扰动优化技术增强LoRA,在保持效率的同时有效稳定训练过程并提升语言任务性能。
English: The proposed LoRA-MGPO framework enhances LoRA by integrating Momentum-Guided Perturbation Optimization, which stabilizes training and improves performance on language tasks without sacrificing efficiency.

Authors:Jannik Irmai, Maximilian Moeller, Bjoern Andres
Title: Algorithms for the preordering problem and their application to the task of jointly clustering and ordering the accounts of a social network
Abstract:
The NP-hard maximum value preordering problem is both a joint relaxation and a hybrid of the clique partition problem (a clustering problem) and the partial ordering problem. Toward approximate solutions and lower bounds, we introduce a linear-time 4-approximation algorithm that constructs a maximum dicut of a subgraph and define local search heuristics. Toward upper bounds, we tighten a linear program relaxation by the class of odd closed walk inequalities that define facets, as we show, of the preorder polytope. We contribute implementations of the algorithms, apply these to the task of jointly clustering and partially ordering the accounts of published social networks, and compare the output and efficiency qualitatively and quantitatively.
中文: 本研究针对最大价值预排序问题提出了4-近似算法和局部搜索启发式方法,并通过定义多面体的不等式加强线性规划松弛,最后将这些方法应用于社交网络分析的聚类和偏序任务。
English: This study presents a 4-approximation algorithm and local search heuristics for the maximum value preordering problem, along with tightened linear programming relaxations using facet-defining inequalities, and applies these methods to social network analysis.

Authors:Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo
Title: CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models
Abstract:
Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated remarkable real-world capabilities, effectively collaborating to complete complex tasks. While these systems are designed with safety mechanisms, such as rejecting harmful instructions through alignment, their security remains largely unexplored. This gap leaves LLM-MASs vulnerable to targeted disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks (Corba), a novel and simple yet highly effective attack that disrupts interactions between agents within an LLM-MAS. Corba leverages two key properties: its contagious nature allows it to propagate across arbitrary network topologies, while its recursive property enables sustained depletion of computational resources. Notably, these blocking attacks often involve seemingly benign instructions, making them particularly challenging to mitigate using conventional alignment methods. We evaluate Corba on two widely-used LLM-MASs, namely, AutoGen and Camel across various topologies and commercial models. Additionally, we conduct more extensive experiments in open-ended interactive LLM-MASs, demonstrating the effectiveness of Corba in complex topology structures and open-source models. Our code is available at: https://github.com/zhrli324/Corba.
中文: 本文提出Corba攻击,这种具有传染性和递归性的方法能通过看似无害的指令在网络中传播并持续消耗资源,有效破坏基于大语言模型的多智能体系统,对传统安全防护机制构成挑战。
English: This paper introduces Corba, a contagious and recursive attack that effectively disrupts LLM-based multi-agent systems by propagating across networks and depleting resources through seemingly harmless instructions, challenging conventional safety measures.

Authors:Jiahao Qi, Chuanhong Zhou, Xingyue Liu, Chen Chen, Dehui Zhu, Kangcheng Bin, Ping Zhong
Title: Nearshore Underwater Target Detection Meets UAV-borne Hyperspectral Remote Sensing: A Novel Hybrid-level Contrastive Learning Framework and Benchmark Dataset
Abstract:
UAV-borne hyperspectral remote sensing has emerged as a promising approach for underwater target detection (UTD). However, its effectiveness is hindered by spectral distortions in nearshore environments, which compromise the accuracy of traditional hyperspectral UTD (HUTD) methods that rely on bathymetric model. These distortions lead to significant uncertainty in target and background spectra, challenging the detection process. To address this, we propose the Hyperspectral Underwater Contrastive Learning Network (HUCLNet), a novel framework that integrates contrastive learning with a self-paced learning paradigm for robust HUTD in nearshore regions. HUCLNet extracts discriminative features from distorted hyperspectral data through contrastive learning, while the self-paced learning strategy selectively prioritizes the most informative samples. Additionally, a reliability-guided clustering strategy enhances the robustness of learned representations.To evaluate the method effectiveness, we conduct a novel nearshore HUTD benchmark dataset, ATR2-HUTD, covering three diverse scenarios with varying water types and turbidity, and target types. Extensive experiments demonstrate that HUCLNet significantly outperforms state-of-the-art methods. The dataset and code will be publicly available at: https://github.com/qjh1996/HUTD
Chinese: 提出的高光谱水下对比学习网络(HUCLNet)通过结合对比学习和自步学习,有效解决了近岸水下目标检测中的光谱失真问题,在新型ATR2-HUTD基准数据集上显著优于现有方法。
English: The proposed Hyperspectral Underwater Contrastive Learning Network (HUCLNet) effectively addresses spectral distortions in nearshore underwater target detection by integrating contrastive learning and self-paced learning, significantly outperforming existing methods on the new ATR2-HUTD benchmark dataset.

Authors:Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu
Title: StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
Abstract:
Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependencies between dialogue turns that distinguish multi-turn from single-turn interactions. These structural dependencies not only reflect user intent but also establish an essential second dimension for the instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark defines an innovative structural flow framework with six fundamental inter-turn relationships. These relationships introduce novel structural constraints for model evaluation and also serve as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at https://github.com/MLGroupJLU/StructFlowBench.
中文摘要:StructFlowBench是一个新的多轮对话评估基准,专门用于测试大语言模型对对话结构依赖关系的理解能力,实验结果表明当前模型在这方面存在明显不足。
English Summary: StructFlowBench is a new benchmark designed to evaluate large language models' ability to handle structural dependencies in multi-turn conversations, revealing significant shortcomings in current models' understanding of dialogue flow.

Authors:Lorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, Solon P. Pissis
Title: U-index: A Universal Indexing Framework for Matching Long Patterns
Abstract:
Text indexing is a fundamental and well-studied problem. Classic solutions either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching. The former solutions thus retain excellent compressed space, but are slow in practice. The latter approaches, like the suffix array, instead sacrifice space for speed. We show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text). Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so demonstrate applications in long-read mapping.
中文摘要:作者提出了一种通用文本索引框架,通过使用模式与文本的草图匹配,在保持查询效率的同时显著减少索引构建时间与存储空间,尤其适用于长模式查询场景。
English Summary: The authors propose a universal text indexing framework that uses sketches of patterns and text to achieve efficient indexing with minimal extra space, significantly improving construction speed and reducing space usage while maintaining or enhancing query performance for sufficiently long patterns.

Authors:Chengyu Fang, Chunming He, Longxiang Tang, Yuelin Zhang, Chenyang Zhu, Yuqi Shen, Chubin Chen, Guoxia Xu, Xiu Li
Title: Integrating Extra Modality Helps Segmentor Find Camouflaged Objects Well
Abstract:
Camouflaged Object Segmentation (COS) remains challenging because camouflaged objects exhibit only subtle visual differences from their backgrounds and single-modality RGB methods provide limited cues, leading researchers to explore multimodal data to improve segmentation accuracy. In this work, we presenet MultiCOS, a novel framework that effectively leverages diverse data modalities to improve segmentation performance. MultiCOS comprises two modules: Bi-space Fusion Segmentor (BFSer), which employs a state space and a latent space fusion mechanism to integrate cross-modal features within a shared representation and employs a fusion-feedback mechanism to refine context-specific features, and Cross-modal Knowledge Learner (CKLer), which leverages external multimodal datasets to generate pseudo-modal inputs and establish cross-modal semantic associations, transferring knowledge to COS models when real multimodal pairs are missing. When real multimodal COS data are unavailable, CKLer yields additional segmentation gains using only non-COS multimodal sources. Experiments on standard COS benchmarks show that BFSer outperforms existing multimodal baselines with both real and pseudo-modal data. Code will be released at \href{https://github.com/cnyvfang/MultiCOS}{GitHub}.
Chinese: MultiCOS是一种新颖框架,通过双空间融合分割器整合跨模态特征,并利用跨模态知识学习器借助外部数据集提升伪装物体分割性能,即使在缺乏真实多模态数据时也能实现精度提升。
English: MultiCOS is a novel framework that enhances Camouflaged Object Segmentation by integrating cross-modal features through its Bi-space Fusion Segmentor and leveraging external datasets via the Cross-modal Knowledge Learner to improve accuracy even without real multimodal data.

Authors:Cristian A. Galvis-Florez, Ahmad Farooq, Simo Särkkä
Title: Provable Quantum Algorithm Advantage for Gaussian Process Quadrature
Abstract:
The aim of this paper is to develop novel quantum algorithms for Gaussian process quadrature methods. Gaussian process quadratures are numerical integration methods where Gaussian processes are used as functional priors for the integrands to capture the uncertainty arising from the sparse function evaluations. Quantum computers have emerged as potential replacements for classical computers, offering exponential reductions in the computational complexity of machine learning tasks. In this paper, we combine Gaussian process quadratures and quantum computing by proposing a quantum low-rank Gaussian process quadrature method based on a Hilbert space approximation of the Gaussian process kernel and enhancing the quadrature using a quantum circuit. The method combines the quantum phase estimation algorithm with the quantum principal component analysis technique to extract information up to a desired rank. Then, Hadamard and SWAP tests are implemented to find the expected value and variance that determines the quadrature. We use numerical simulations of a quantum computer to demonstrate the effectiveness of the method. Furthermore, we provide a theoretical complexity analysis that shows a polynomial advantage over classical Gaussian process quadrature methods. The code is available at https://github.com/cagalvisf/Quantum_HSGPQ.
本文提出了一种新颖的高斯过程求积量子算法,通过结合量子相位估计与主成分分析技术,在模拟实验中验证了其计算效率,并证明相比经典方法具有多项式级加速优势。
This paper introduces a novel quantum algorithm for Gaussian process quadrature that combines quantum phase estimation with principal component analysis, demonstrating both computational efficiency through simulations and a polynomial speed advantage over classical methods.

Authors:Lorenzo Pacchiardi, Konstantinos Voudouris, Ben Slater, Fernando Martínez-Plumed, José Hernández-Orallo, Lexin Zhou, Wout Schellaert
Title: PredictaBoard: Benchmarking LLM Score Predictability
Abstract:
Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated. Code for our benchmark can be found at https://github.com/Kinds-of-Intelligence-CFI/PredictaBoard
Chinese: PredictaBoard是一个协作式基准测试框架,用于评估评分预测器对特定任务中大型语言模型错误的预判能力,通过前瞻性风险防控推动构建更安全、更可预测的人工智能系统。
English: PredictaBoard is a collaborative benchmarking framework that evaluates how well assessors can predict LLM errors on specific prompts, aiming to enhance LLM predictability and safety by anticipating and mitigating risks beyond just improving average performance.

Authors:Marco ComunitÃ, Christian J. Steinmetz, Joshua D. Reiss
Title: Differentiable Black-box and Gray-box Modeling of Nonlinear Audio Effects
Abstract:
Audio effects are extensively used at every stage of audio and music content creation. The majority of differentiable audio effects modeling approaches fall into the black-box or gray-box paradigms; and most models have been proposed and applied to nonlinear effects like guitar amplifiers, overdrive, distortion, fuzz and compressor. Although a plethora of architectures have been introduced for the task at hand there is still lack of understanding on the state of the art, since most publications experiment with one type of nonlinear audio effect and a very small number of devices. In this work we aim to shed light on the audio effects modeling landscape by comparing black-box and gray-box architectures on a large number of nonlinear audio effects, identifying the most suitable for a wide range of devices. In the process, we also: introduce time-varying gray-box models and propose models for compressor, distortion and fuzz, publish a large dataset for audio effects research - ToneTwist AFx https://github.com/mcomunita/tonetwist-afx-dataset - that is also the first open to community contributions, evaluate models on a variety of metrics and conduct extensive subjective evaluation. Code https://github.com/mcomunita/nablafx and supplementary material https://github.com/mcomunita/nnlinafx-supp-material are also available.
Chinese: 本研究通过比较多种非线性音频效果的黑盒与灰盒架构,确定了最适用的模型,同时引入了时变灰盒模型,提出了压缩器、失真和法兹效果的新设计,并发布了可供社区贡献的数据集和代码以推动研究。
English: This study compares black-box and gray-box architectures across numerous nonlinear audio effects to identify the most effective models, while also introducing time-varying models, proposing new designs for compressors, distortion, and fuzz, and releasing a community-contributable dataset and code for further research.

Authors:Paul Friedrich, Florentin Bieder, Julian McGinnis, Julia Wolleb, Daniel Rueckert, Philippe C. Cattin
Title: MedFuncta: A Unified Framework for Learning Efficient Medical Neural Fields
Abstract:
Research in medical imaging primarily focuses on discrete data representations that poorly scale with grid resolution and fail to capture the often continuous nature of the underlying signal. Neural Fields (NFs) offer a powerful alternative by modeling data as continuous functions. While single-instance NFs have successfully been applied in medical contexts, extending them to large-scale medical datasets remains an open challenge. We therefore introduce MedFuncta, a unified framework for large-scale NF training on diverse medical signals. Building on Functa, our approach encodes data into a unified representation, namely a 1D latent vector, that modulates a shared, meta-learned NF, enabling generalization across a dataset. We revisit common design choices, introducing a non-constant frequency parameter $ω$ in widely used SIREN activations, and establish a connection between this $ω$-schedule and layer-wise learning rates, relating our findings to recent work in theoretical learning dynamics. We additionally introduce a scalable meta-learning strategy for shared network learning that employs sparse supervision during training, thereby reducing memory consumption and computational overhead while maintaining competitive performance. Finally, we evaluate MedFuncta across a diverse range of medical datasets and show how to solve relevant downstream tasks on our neural data representation. To promote further research in this direction, we release our code, model weights and the first large-scale dataset - MedNF - containing > 500 k latent vectors for multi-instance medical NFs.
Chinese: 医学影像研究常受限于离散数据表示,因此MedFuncta提出了一个统一框架,利用神经场建模连续函数,并通过元学习和优化激活函数实现在大规模医学数据集上的泛化能力。
English: Medical imaging research often struggles with discrete data representations, so MedFuncta introduces a unified framework using neural fields to model continuous functions and enable generalization across large-scale medical datasets through meta-learning and optimized activation functions.

Authors:Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica
Title: S*: Test Time Scaling for Code Generation
Abstract:
Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%. Code will be available under https://github.com/NovaSky-AI/SkyThought.
中文:S*是一个创新的混合测试时扩展框架,通过结合并行与顺序扩展策略,并采用自适应输入选择和基于执行的验证机制,显著提升了代码生成性能,使小模型可超越大模型、非推理模型能超过推理模型。
English: S* is a novel hybrid test-time scaling framework that enhances code generation by combining parallel and sequential scaling with adaptive input selection and execution-based verification, significantly boosting performance across various models including enabling smaller models to surpass larger ones and non-reasoning models to exceed reasoning models.

Authors:Louis Carpentier, Nick Seeuws, Wannes Meert, Mathias Verbeke
Title: dtaianomaly: A Python library for time series anomaly detection
Abstract:
dtaianomaly is an open-source Python library for time series anomaly detection, designed to bridge the gap between academic research and real-world applications. Our goal is to (1) accelerate the development of novel state-of-the-art anomaly detection techniques through simple extensibility; (2) offer functionality for large-scale experimental validation; and thereby (3) bring cutting-edge research to business and industry through a standardized API, similar to scikit-learn to lower the entry barrier for both new and experienced users. Besides these key features, dtaianomaly offers (1) a broad range of built-in anomaly detectors, (2) support for time series preprocessing, (3) tools for visual analysis, (4) confidence prediction of anomaly scores, (5) runtime and memory profiling, (6) comprehensive documentation, and (7) cross-platform unit testing. The source code of dtaianomaly, documentation, code examples and installation guides are publicly available at https://github.com/ML-KULeuven/dtaianomaly.
中文: dtaianomaly 是一个开源的时间序列异常检测Python库,旨在通过类scikit-learn的标准化API降低使用门槛,既支持前沿算法的扩展开发与实验验证,又为工业应用提供丰富的内置功能。
English: dtaianomaly is an open-source Python library for time series anomaly detection that facilitates the development and validation of advanced techniques while providing industry-ready tools through a scikit-learn-like API.

Authors:Ke Cao, Jing Wang, Ao Ma, Jiasong Feng, Zhanjie Zhang, Xuanhua He, Shanyuan Liu, Bo Cheng, Dawei Leng, Yuhui Yin, Jie Zhang
Title: RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers
Abstract:
The Diffusion Transformer plays a pivotal role in advancing text-to-image and text-to-video generation, owing primarily to its inherent scalability. However, existing controlled diffusion transformer methods incur significant parameter and computational overheads and suffer from inefficient resource allocation due to their failure to account for the varying relevance of control information across different transformer layers. To address this, we propose the Relevance-Guided Efficient Controllable Generation framework, RelaCtrl, enabling efficient and resource-optimized integration of control signals into the Diffusion Transformer. First, we evaluate the relevance of each layer in the Diffusion Transformer to the control information by assessing the "ControlNet Relevance Score"-i.e., the impact of skipping each control layer on both the quality of generation and the control effectiveness during inference. Based on the strength of the relevance, we then tailor the positioning, parameter scale, and modeling capacity of the control layers to reduce unnecessary parameters and redundant computations. Additionally, to further improve efficiency, we replace the self-attention and FFN in the commonly used copy block with the carefully designed Two-Dimensional Shuffle Mixer (TDSM), enabling efficient implementation of both the token mixer and channel mixer. Both qualitative and quantitative experimental results demonstrate that our approach achieves superior performance with only 15% of the parameters and computational complexity compared to PixArt-delta.
扩散变换器在文本到图像和视频生成中发挥关键作用,但现有方法因参数和计算开销大且资源分配低效而受限;我们提出的RelaCtrl框架通过评估控制信息在各层的相关性并优化控制层结构与计算模块,仅需15%的参数和计算量即可实现卓越性能。
Diffusion Transformer significantly advances text-to-image and video generation but faces efficiency challenges due to excessive parameters and poor resource allocation, which our RelaCtrl framework addresses by optimizing control layer relevance and computational structure for high performance with minimal resources.

Authors:Moxin Li, Yuantao Zhang, Wenjie Wang, Wentao Shi, Zhuo Liu, Fuli Feng, Tat-Seng Chua
Title: Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment
Abstract:
Multi-Objective Alignment (MOA) aims to align LLMs' responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines. Code is available at https://github.com/zyttt-coder/SIPO.
Chinese: 该研究提出了一种自改进的直接偏好优化框架,使大语言模型能够自我生成并选择帕累托最优响应,有效解决偏好冲突,在多目标对齐方面相比现有方法实现了更优的性能。
English: The study introduces a self-improving Direct Preference Optimization framework that enables large language models to self-generate and select Pareto-optimal responses, effectively resolving preference conflicts and achieving superior multi-objective alignment compared to existing methods.

Authors:Yuchen Shi, Siqi Cai, Zihan Xu, Yuei Qin, Gang Li, Hang Shao, Jiawei Chen, Deqing Yang, Ke Li, Xing Sun
Title: FlowAgent: Achieving Compliance and Flexibility for Workflow Agents
Abstract:
The integration of workflows with large language models (LLMs) enables LLM-based agents to execute predefined procedures, enhancing automation in real-world applications. Traditional rule-based methods tend to limit the inherent flexibility of LLMs, as their predefined execution paths restrict the models' action space, particularly when the unexpected, out-of-workflow (OOW) queries are encountered. Conversely, prompt-based methods allow LLMs to fully control the flow, which can lead to diminished enforcement of procedural compliance. To address these challenges, we introduce FlowAgent, a novel agent framework designed to maintain both compliance and flexibility. We propose the Procedure Description Language (PDL), which combines the adaptability of natural language with the precision of code to formulate workflows. Building on PDL, we develop a comprehensive framework that empowers LLMs to manage OOW queries effectively, while keeping the execution path under the supervision of a set of controllers. Additionally, we present a new evaluation methodology to rigorously assess an LLM agent's ability to handle OOW scenarios, going beyond routine flow compliance tested in existing benchmarks. Experiments on three datasets demonstrate that FlowAgent not only adheres to workflows but also effectively manages OOW queries, highlighting its dual strengths in compliance and flexibility. The code is available at https://github.com/Lightblues/FlowAgent.
Chinese: FlowAgent提出了一种新颖的框架,通过使用过程描述语言,在基于大语言模型的智能体中兼顾合规性与灵活性,有效管理预定义工作流和意外的工作流外查询。
English: FlowAgent introduces a novel framework that combines compliance and flexibility in LLM-based agents by using a Procedure Description Language to manage both predefined workflows and unexpected out-of-workflow queries effectively.

Authors:Ruichen Shao, Bei Li, Gangao Liu, Yang Chen, Xiang Zhou, Jingang Wang, Xunliang Cai, Peng Li
Title: Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective
Abstract:
Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{https://github.com/LotuSrc/D2PO}.
Chinese: 提出的增强偏好优化方法引入时间衰减因子,根据标记位置动态调整奖励权重,有效缓解DPO的长度偏差,在基准测试中性能提升5.9-8.8分,同时保持模型的通用能力。
English: The proposed enhanced preference optimization method introduces a temporal decay factor to dynamically weight rewards based on token position, effectively mitigating DPO's length bias and improving alignment performance by 5.9-8.8 points on benchmarks while preserving general capabilities.

Authors:Avinash Patil, Siru Tao, Aryan Jadon
Title: English Please: Evaluating Machine Translation with Large Language Models for Multilingual Bug Reports
Abstract:
Accurate translation of bug reports is critical for efficient collaboration in global software development. In this study, we conduct the first comprehensive evaluation of machine translation (MT) performance on bug reports, analyzing the capabilities of DeepL, AWS Translate, and large language models such as ChatGPT, Claude, Gemini, LLaMA, and Mistral using data from the Visual Studio Code GitHub repository, specifically focusing on reports labeled with the english-please tag. To assess both translation quality and source language identification accuracy, we employ a range of MT evaluation metrics-including BLEU, BERTScore, COMET, METEOR, and ROUGE-alongside classification metrics such as accuracy, precision, recall, and F1-score. Our findings reveal that while ChatGPT (gpt-4o) excels in semantic and lexical translation quality, it does not lead in source language identification. Claude and Mistral achieve the highest F1-scores (0.7182 and 0.7142, respectively), and Gemini records the best precision (0.7414). AWS Translate shows the highest accuracy (0.4717) in identifying source languages. These results highlight that no single system dominates across all tasks, reinforcing the importance of task-specific evaluations. This study underscores the need for domain adaptation when translating technical content and provides actionable insights for integrating MT into bug-triaging workflows. The code and dataset for this paper are available at GitHub-https://github.com/av9ash/English-Please
中文摘要:本研究评估了多种机器翻译系统处理错误报告的表现,发现ChatGPT在翻译质量上最优,而Claude和Mistral在源语言识别方面领先,表明没有单一系统能在所有任务中全面胜出。
English Summary: This study evaluates machine translation systems for bug reports, finding that ChatGPT excels in translation quality while Claude and Mistral lead in source language identification, demonstrating no single system performs best across all tasks.

Authors:Zhucong Li, Jin Xiao, Bowei Zhang, Zhijian Zhou, Qianyu He, Fenglei Cao, Jiaqing Liang, Yuan Qi
Title: ChemHTS: Hierarchical Tool Stacking for Enhancing Chemical Agents
Abstract:
Large Language Models (LLMs) have demonstrated remarkable potential in scientific research, particularly in chemistry-related tasks such as molecular design, reaction prediction, and property estimation. While tool-augmented LLMs have been introduced to enhance reasoning and computation in these domains, existing approaches suffer from tool invocation errors and lack effective collaboration among diverse tools, limiting their overall performance. To address these challenges, we propose ChemHTS (Chemical Hierarchical Tool Stacking), a novel method that optimizes tool invocation pathways through a hierarchical stacking strategy. ChemHTS consists of two key stages: tool self-stacking warmup and multi-layer decision optimization, enabling LLMs to refine tool usage dynamically. We evaluate ChemHTS across four classical chemistry tasks and demonstrate its superiority over strong baselines, including GPT-4o, DeepSeek-R1, and chemistry-specific models, including ChemDFM. Furthermore, we define four distinct tool-stacking behaviors to enhance interpretability, providing insights into the effectiveness of tool collaboration. Our dataset and code are publicly available at \url{https://github.com/Chang-pw/ChemHTS}.
中文:提出的ChemHTS方法通过分层堆叠策略优化工具协作,在多项化学任务中超越现有模型,同时增强了工具使用的可解释性。
English: The proposed ChemHTS method enhances chemical research by optimizing tool collaboration through hierarchical stacking, outperforming existing models in multiple chemistry tasks while improving interpretability.

Authors:Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong
Title: ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
Abstract:
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs), as most training-free extrapolation methods are not only severely limited by memory bottlenecks, but also suffer from the attention sink, which restricts their scalability and effectiveness in practice. In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck, enabling 8B-parameter LLMs to extrapolate from 8K to 128K tokens on a single A100 80GB GPU in a training-free setting. ParallelComp splits the input into chunks, dynamically evicting redundant chunks and irrelevant tokens, supported by a parallel KV cache eviction mechanism. Importantly, we present a systematic theoretical and empirical analysis of attention biases in parallel attention-including the attention sink, recency bias, and middle bias-and reveal that these biases exhibit distinctive patterns under ultra-long context settings. We further design a KV cache eviction technique to mitigate this phenomenon. Experimental results show that ParallelComp enables an 8B model (trained on 8K context) to achieve 91.17% of GPT-4's performance under ultra-long contexts, outperforming closed-source models such as Claude-2 and Kimi-Chat. We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss and pave the way for scalable and robust ultra-long contexts extrapolation in LLMs. We release the code at https://github.com/menik1126/ParallelComp.
中文: ParallelComp是一种无需训练的并行压缩方法,通过克服内存瓶颈和注意力偏差,使80亿参数大模型在单GPU上实现从8K到128K标记的上下文扩展,性能接近GPT-4且大幅提升处理速度。
English: ParallelComp is a training-free parallel compression method that enables 8B LLMs to extrapolate from 8K to 128K tokens on a single GPU by overcoming memory bottlenecks and attention biases, achieving near-GPT-4 performance with significant speed improvements.

Authors:Yurong Wu, Fangwen Mu, Qiuhong Zhang, Jinjing Zhao, Xinrun Xu, Lingrui Mei, Yang Wu, Lin Shi, Junjie Wang, Zhiming Ding, Yiwei Wang
Title: Vulnerability of Text-to-Image Models to Prompt Template Stealing: A Differential Evolution Approach
Abstract:
Prompt trading has emerged as a significant intellectual property concern in recent years, where vendors entice users by showcasing sample images before selling prompt templates that can generate similar images. This work investigates a critical security vulnerability: attackers can steal prompt templates using only a limited number of sample images. To investigate this threat, we introduce Prism, a prompt-stealing benchmark consisting of 50 templates and 450 images, organized into Easy and Hard difficulty levels. To identify the vulnerabity of VLMs to prompt stealing, we propose EvoStealer, a novel template stealing method that operates without model fine-tuning by leveraging differential evolution algorithms. The system first initializes population sets using multimodal large language models (MLLMs) based on predefined patterns, then iteratively generates enhanced offspring through MLLMs. During evolution, EvoStealer identifies common features across offspring to derive generalized templates. Our comprehensive evaluation conducted across open-source (INTERNVL2-26B) and closed-source models (GPT-4o and GPT-4o-mini) demonstrates that EvoStealer's stolen templates can reproduce images highly similar to originals and effectively generalize to other subjects, significantly outperforming baseline methods with an average improvement of over 10%. Moreover, our cost analysis reveals that EvoStealer achieves template stealing with negligible computational expenses. Our code and dataset are available at https://github.com/whitepagewu/evostealer.
中文: 本研究提出EvoStealer,一种无需模型微调即可通过差分进化算法从样本图像中窃取提示模板的新方法,其性能显著优于基线方法,平均提升超过10%,且计算成本极低。
English: This study introduces EvoStealer, a novel prompt-stealing method that uses differential evolution algorithms to extract prompt templates from sample images without model fine-tuning, demonstrating superior performance over baselines with over 10% improvement and minimal computational cost.

Authors:Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, Fei Huang
Title: PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
Abstract:
In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/PC-Agent.
中文:PC-Agent框架通过引入具有主动感知和反思机制的分层多智能体系统,解决了PC交互的复杂性,在新型PC-Eval基准测试中实现了任务成功率32%的绝对提升。
English: The PC-Agent framework addresses complex PC interactions by introducing a hierarchical multi-agent system with active perception and reflection mechanisms, achieving a 32% improvement in task success rate on the new PC-Eval benchmark.

Authors:Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li
Title: STeCa: Step-level Trajectory Calibration for LLM Agent Learning
Abstract:
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.
Chinese: 提出的STeCa框架通过步骤级奖励比较识别次优行动,并利用基于大语言模型的反思构建校准轨迹,有效解决了现有方法在长周期任务中的不足,显著提升了智能体的任务完成能力和鲁棒性。
English: The proposed STeCa framework addresses the limitations of existing LLM agent training methods by identifying suboptimal actions through step-level reward comparisons and constructing calibrated trajectories via LLM-driven reflection, significantly enhancing performance and robustness in long-horizon tasks.

Authors:Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang
Title: EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement
Abstract:
Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{https://github.com/Retinal-Research/EyeBench}
中文摘要:过去十年,生成模型在眼底图像增强方面取得显著进展,但其评估仍面临挑战,为此我们提出EyeBench综合基准,通过多维临床对齐评估和医学专家指导设计,使增强模型更贴合临床需求。
English Summary: Over the past decade, generative models have advanced fundus image enhancement, but their evaluation remains challenging, leading to the creation of EyeBench, a comprehensive benchmark that aligns models with clinical needs through multi-dimensional assessments and expert-guided protocols.

Authors:Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, Pan Ji
Title: Pandora3D: A Comprehensive Framework for High-Quality 3D Shape and Texture Generation
Abstract:
This report presents a comprehensive framework for generating high-quality 3D shapes and textures from diverse input prompts, including single images, multi-view images, and text descriptions. The framework consists of 3D shape generation and texture generation. (1). The 3D shape generation pipeline employs a Variational Autoencoder (VAE) to encode implicit 3D geometries into a latent space and a diffusion network to generate latents conditioned on input prompts, with modifications to enhance model capacity. An alternative Artist-Created Mesh (AM) generation approach is also explored, yielding promising results for simpler geometries. (2). Texture generation involves a multi-stage process starting with frontal images generation followed by multi-view images generation, RGB-to-PBR texture conversion, and high-resolution multi-view texture refinement. A consistency scheduler is plugged into every stage, to enforce pixel-wise consistency among multi-view textures during inference, ensuring seamless integration. The pipeline demonstrates effective handling of diverse input formats, leveraging advanced neural architectures and novel methodologies to produce high-quality 3D content. This report details the system architecture, experimental results, and potential future directions to improve and expand the framework. The source code and pretrained weights are released at: https://github.com/Tencent/Tencent-XR-3DGen.
中文摘要:该框架通过基于变分自编码器的三维形状生成与多阶段纹理处理流程,结合一致性调度机制,能够从多种输入生成高质量的三维模型和纹理。
English Summary: This framework generates high-quality 3D shapes and textures from various inputs using a VAE-based shape generator with diffusion modeling and a multi-stage texture pipeline enhanced by consistency scheduling for seamless results.

Authors:Gengxu Li, Yuan Wu
Title: Asymmetric Co-Training for Source-Free Few-Shot Domain Adaptation
Abstract:
Source-free unsupervised domain adaptation (SFUDA) has gained significant attention as an alternative to traditional unsupervised domain adaptation (UDA), which relies on the constant availability of labeled source data. However, SFUDA approaches come with inherent limitations that are frequently overlooked. These challenges include performance degradation when the unlabeled target data fails to meet critical assumptions, such as having a closed-set label distribution identical to that of the source domain, or when sufficient unlabeled target data is unavailable-a common situation in real-world applications. To address these issues, we propose an asymmetric co-training (ACT) method specifically designed for the SFFSDA scenario. SFFSDA presents a more practical alternative to SFUDA, as gathering a few labeled target instances is more feasible than acquiring large volumes of unlabeled target data in many real-world contexts. Our ACT method begins by employing a weak-strong augmentation to enhance data diversity. Then we use a two-step optimization process to train the target model. In the first step, we optimize the label smoothing cross-entropy loss, the entropy of the class-conditional distribution, and the reverse-entropy loss to bolster the model's discriminative ability while mitigating overfitting. The second step focuses on reducing redundancy in the output space by minimizing classifier determinacy disparity. Extensive experiments across four benchmarks demonstrate the superiority of our ACT approach, which outperforms state-of-the-art SFUDA methods and transfer learning techniques. Our findings suggest that adapting a source pre-trained model using only a small amount of labeled target data offers a practical and dependable solution. The code is available at https://github.com/gengxuli/ACT.
中文: 提出的非对称协同训练方法通过强弱数据增强和两步优化过程解决了无源无监督域适应的局限性,仅需少量目标域标注数据即可超越现有方法的性能表现。
English: The proposed asymmetric co-training (ACT) method addresses the limitations of source-free unsupervised domain adaptation by using weak-strong data augmentation and a two-step optimization process, demonstrating superior performance over existing methods with only minimal labeled target data.

Authors:Yupeng Chang, Yi Chang, Yuan Wu
Title: Transfer-Prompting: Enhancing Cross-Task Adaptation in Large Language Models via Dual-Stage Prompts Optimization
Abstract:
Large language models (LLMs) face significant challenges when balancing multiple high-level objectives, such as generating coherent, relevant, and high-quality responses while maintaining efficient task adaptation across diverse tasks. To address these challenges, we introduce Transfer-Prompting, a novel two-stage framework designed to enhance cross-task adaptation in prompt generation. The framework comprises two key components: (1) source prompt construction, which refines the original prompts on source task datasets to generate source prompts with enhanced generalization ability, and (2) target prompt generation, which enhances cross-task adaptation of target prompts by fine-tuning a set of high-scored source prompts on task-specific datasets. In each optimization cycle, a reference LLM generates candidate prompts based on historical prompt-score pairs and task descriptions in our designed reference prompt. These candidate prompts are refined iteratively, while a scorer LLM evaluates their effectiveness using the multi-dimensional metrics designed in the objective prompts evaluator-a novel contribution in this work that provides a holistic evaluation of prompt quality and task performance. This feedback loop facilitates continuous refinement, optimizing both prompt quality and task-specific outcomes. We validate Transfer-Prompting through extensive experiments across 25 LLMs, including 7 foundational models and 18 specialized models, evaluated on 9 diverse datasets. The results demonstrate that Transfer-Prompting significantly improves task-specific performance, highlighting its potential for enhancing cross-task adaptation in LLMs. The code is available at https://github.com/llm172/Transfer-Prompting.
中文摘要:Transfer-Prompting框架通过源提示构建和目标提示生成的两阶段设计,显著提升大语言模型的跨任务适应能力,在多模型和多数据集的实验中验证了其有效性。
English Summary: The Transfer-Prompting framework enhances cross-task adaptation in LLMs through a two-stage process of source prompt construction and target prompt generation, validated by significant performance improvements across multiple models and datasets.

Authors:Michihiro Yasunaga, Luke Zettlemoyer, Marjan Ghazvininejad
Title: Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models
Abstract:
Reward models play an essential role in training vision-language models (VLMs) by assessing output quality to enable aligning with human preferences. Despite their importance, the research community lacks comprehensive open benchmarks for evaluating multimodal reward models in VLMs. To address this gap, we introduce Multimodal RewardBench, an expert-annotated benchmark covering six domains: general correctness, preference, knowledge, reasoning, safety, and visual question-answering. Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various VLMs. In evaluating a range of VLM judges, we find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy. Notably, most models struggle in the reasoning and safety domains. These findings suggest that Multimodal RewardBench offers a challenging testbed for advancing reward model development across multiple domains. We release the benchmark at https://github.com/facebookresearch/multimodal_rewardbench.
中文摘要:Multimodal RewardBench作为专家标注的多模态奖励模型评估基准涵盖六大领域,结果显示即使表现最佳的模型总体准确率仅为72%,且在推理和安全领域存在明显不足。
English Summary: Multimodal RewardBench is introduced as an expert-annotated benchmark to evaluate multimodal reward models across six domains, revealing that top models like Gemini 1.5 Pro and Claude 3.5 Sonnet achieve only 72% accuracy and struggle particularly in reasoning and safety.

Authors:Shokhrukh Ibragimov, Arnulf Jentzen, Benno Kuckuck
Title: On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems
Abstract:
We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of first-order logic and set theory, it does require a degree of planning and logical reasoning, which can be controlled up to arbitrarily high difficulty by the complexity of the generated statements. Furthermore, we do extensive evaluations of the performance of various large language models, including recent models such as DeepSeek-R1 and OpenAI's o3-mini, on these datasets. All of the datasets along with the code used for generating them, as well as all data from the evaluations is publicly available at https://github.com/bkuckuck/logical-skills-of-llms.
Chinese: 本文提出了一种可控制复杂度的生成一阶逻辑语句的方法,并利用该方法创建数据集来评估包括DeepSeek-R1和OpenAI的o3-mini在内的多种大语言模型的逻辑推理能力。
English: This paper introduces a method for generating first-order logic statements with controllable complexity and uses it to create datasets for evaluating the logical reasoning abilities of large language models, including recent ones like DeepSeek-R1 and OpenAI's o3-mini.

Authors:Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam
Title: PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery
Abstract:
Vision-Language Models (VLMs) in visual question answering (VQA) offer a unique opportunity to enhance intra-operative decision-making, promote intuitive interactions, and significantly advancing surgical education. However, the development of VLMs for surgical VQA is challenging due to limited datasets and the risk of overfitting and catastrophic forgetting during full fine-tuning of pretrained weights. While parameter-efficient techniques like Low-Rank Adaptation (LoRA) and Matrix of Rank Adaptation (MoRA) address adaptation challenges, their uniform parameter distribution overlooks the feature hierarchy in deep networks, where earlier layers, that learn general features, require more parameters than later ones. This work introduces PitVQA++ with an open-ended PitVQA dataset and vector matrix-low-rank adaptation (Vector-MoLoRA), an innovative VLM fine-tuning approach for adapting GPT-2 to pituitary surgery. Open-Ended PitVQA comprises around 101,803 frames from 25 procedural videos with 745,972 question-answer sentence pairs, covering key surgical elements such as phase and step recognition, context understanding, tool detection, localization, and interactions recognition. Vector-MoLoRA incorporates the principles of LoRA and MoRA to develop a matrix-low-rank adaptation strategy that employs vector ranking to allocate more parameters to earlier layers, gradually reducing them in the later layers. Our approach, validated on the Open-Ended PitVQA and EndoVis18-VQA datasets, effectively mitigates catastrophic forgetting while significantly enhancing performance over recent baselines. Furthermore, our risk-coverage analysis highlights its enhanced reliability and trustworthiness in handling uncertain predictions. Our source code and dataset is available at~\url{https://github.com/HRL-Mike/PitVQA-Plus}.
中文: 本研究提出了PitVQA++,包含开放式手术视觉问答数据集和Vector-MoLoRA创新微调方法,通过分层参数分配策略有效提升视觉语言模型在手术应用中的性能,同时防止灾难性遗忘问题。
English: This research introduces PitVQA++, featuring an open-ended surgical visual question answering dataset and Vector-MoLoRA, a novel fine-tuning method that strategically allocates parameters across network layers to enhance performance while preventing catastrophic forgetting in vision-language models for surgical applications.

Authors:Takahiko Furuya
Title: Token Adaptation via Side Graph Convolution for Efficient Fine-tuning of 3D Point Cloud Transformers
Abstract:
Parameter-efficient fine-tuning (PEFT) of pre-trained 3D point cloud Transformers has emerged as a promising technique for 3D point cloud analysis. While existing PEFT methods attempt to minimize the number of tunable parameters, they often suffer from high temporal and spatial computational costs during fine-tuning. This paper proposes a novel PEFT algorithm called Side Token Adaptation on a neighborhood Graph (STAG) to achieve superior temporal and spatial efficiency. STAG employs a graph convolutional side network operating in parallel with a frozen backbone Transformer to adapt tokens to downstream tasks. Through efficient graph convolution, parameter sharing, and reduced gradient computation, STAG significantly reduces both temporal and spatial costs for fine-tuning. We also present Point Cloud Classification 13 (PCC13), a new benchmark comprising diverse publicly available 3D point cloud datasets to facilitate comprehensive evaluation. Extensive experiments using multiple pre-trained models and PCC13 demonstrates the effectiveness of STAG. Specifically, STAG maintains classification accuracy comparable to existing methods while reducing tunable parameters to only 0.43M and achieving significant reductions in both computation time and memory consumption for fine-tuning. Code and benchmark will be available at: https://github.com/takahikof/STAG.
Chinese: 本文提出STAG参数高效微调方法,通过图卷积侧网络在保持分类精度的同时,显著降低了计算时间和内存消耗。
English: This paper introduces STAG, a parameter-efficient fine-tuning method that uses a graph convolutional side network to significantly reduce computational costs while maintaining classification accuracy comparable to existing approaches.

Authors:Yaochen Zhu, Chao Wan, Harald Steck, Dawen Liang, Yesu Feng, Nathan Kallus, Jundong Li
Title: Collaborative Retrieval for Large Language Model-based Conversational Recommender Systems
Abstract:
Conversational recommender systems (CRS) aim to provide personalized recommendations via interactive dialogues with users. While large language models (LLMs) enhance CRS with their superior understanding of context-aware user preferences, they typically struggle to leverage behavioral data, which have proven to be important for classical collaborative filtering (CF)-based approaches. For this reason, we propose CRAG, Collaborative Retrieval Augmented Generation for LLM-based CRS. To the best of our knowledge, CRAG is the first approach that combines state-of-the-art LLMs with CF for conversational recommendations. Our experiments on two publicly available movie conversational recommendation datasets, i.e., a refined Reddit dataset (which we name Reddit-v2) as well as the Redial dataset, demonstrate the superior item coverage and recommendation performance of CRAG, compared to several CRS baselines. Moreover, we observe that the improvements are mainly due to better recommendation accuracy on recently released movies. The code and data are available at https://github.com/yaochenzhu/CRAG.
中文摘要:CRAG将协同过滤与大语言模型相结合,提升了对话推荐系统的性能,在电影推荐数据集上的实验表明,该系统尤其在推荐新上映影片时表现优异,覆盖更广且准确度更高。
English Summary: CRAG integrates collaborative filtering with large language models to enhance conversational recommender systems, demonstrating superior performance and broader item coverage, especially for new releases, in experiments on movie recommendation datasets.

Authors:Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu
Title: Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification
Abstract:
Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impact of these unintended features on classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed self-regularization framework can improve the classifier's generalizability by regularizing those features that are not semantically correlated to the task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. The code and data are publicly available at https://github.com/JacksonWuxs/Controllable_LLM_Classifier.
中文: 本文提出了一种新颖框架,通过稀疏自编码器识别并正则化大语言模型嵌入中的非预期特征,从而提升文本分类器的泛化能力并解决公平性和隐私问题。
English: This paper introduces a novel framework that uses a sparse autoencoder to identify and regularize unintended features in LLM embeddings, enhancing classifier generalizability and addressing fairness and privacy concerns in text classification.

Authors:Yueqing Liang, Liangwei Yang, Chen Wang, Congying Xia, Rui Meng, Xiongxiao Xu, Haoran Wang, Ali Payani, Kai Shu
Title: Benchmarking LLMs for Political Science: A United Nations Perspective
Abstract:
Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequences. We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. Using this dataset, we propose the United Nations Benchmark (UNBench), the first comprehensive benchmark designed to evaluate LLMs across four interconnected political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation. These tasks span the three stages of the UN decision-making process--drafting, voting, and discussing--and aim to assess LLMs' ability to understand and simulate political dynamics. Our experimental analysis demonstrates the potential and challenges of applying LLMs in this domain, providing insights into their strengths and limitations in political science. This work contributes to the growing intersection of AI and political science, opening new avenues for research and practical applications in global governance. The UNBench Repository can be accessed at: https://github.com/yueqingliang1/UNBench.
中文摘要:本文提出首个基于联合国安理会数据的综合评估基准UNBench,通过四项政治学任务系统评估大语言模型在政治决策中的能力,揭示了其在模拟高风险外交进程中的潜力与局限。
English Summary: This paper introduces UNBench, the first comprehensive benchmark using UN Security Council data to evaluate Large Language Models' capabilities in political decision-making tasks, revealing both their potential and limitations in simulating high-stakes diplomatic processes.

Authors:Rui Zhao, Zeyu Zhang, Yi Xu, Yi Yao, Yan Huang, Wenxin Zhang, Zirui Song, Xiuying Chen, Yang Zhao
Title: PedDet: Adaptive Spectral Optimization for Multimodal Pedestrian Detection
Abstract:
Pedestrian detection in intelligent transportation systems has made significant progress but faces two critical challenges: (1) insufficient fusion of complementary information between visible and infrared spectra, particularly in complex scenarios, and (2) sensitivity to illumination changes, such as low-light or overexposed conditions, leading to degraded performance. To address these issues, we propose PedDet, an adaptive spectral optimization complementarity framework specifically enhanced and optimized for multispectral pedestrian detection. PedDet introduces the Multi-scale Spectral Feature Perception Module (MSFPM) to adaptively fuse visible and infrared features, enhancing robustness and flexibility in feature extraction. Additionally, the Illumination Robustness Feature Decoupling Module (IRFDM) improves detection stability under varying lighting by decoupling pedestrian and background features. We further design a contrastive alignment to enhance intermodal feature discrimination. Experiments on LLVIP and MSDS datasets demonstrate that PedDet achieves state-of-the-art performance, improving the mAP by 6.6% with superior detection accuracy even in low-light conditions, marking a significant step forward for road safety. Code will be available at https://github.com/AIGeeksGroup/PedDet.
中文: PedDet提出了一种自适应光谱优化框架,通过多尺度特征融合和光照鲁棒性模块,在复杂光照条件下将检测性能提升6.6% mAP,实现了最先进的行人检测效果。
English: PedDet introduces an adaptive spectral optimization framework with multi-scale feature fusion and illumination robustness modules, achieving state-of-the-art pedestrian detection performance with 6.6% mAP improvement under challenging lighting conditions.

Authors:Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Title: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Abstract:
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400$\times$, end-to-end speedup of up to 3.7$\times$ as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme. The source code is available here: https://github.com/NVlabs/RocketKV.
中文: RocketKV是一种无需训练的KV缓存压缩方法,通过粗粒度淘汰和细粒度稀疏注意力,在长上下文任务中实现高达400倍压缩和3.7倍加速,同时保持近乎无损的精度。
English: RocketKV is a training-free KV cache compression method that employs coarse-grain eviction and fine-grain sparse attention, achieving up to 400× compression and 3.7× speedup with minimal accuracy loss in long-context tasks.

Authors:Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu
Title: Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data
Abstract:
Large Multimodal Models (LMMs), or Vision-Language Models (VLMs), have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address the above challenge, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, and carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality interpretable answers for the next round of tuning. This iterative process of synthetic data generation and fine-tuning progressively improves the model's ability to generate accurate and reasonable explanations. Experimental results demonstrate the effectiveness of our method in improving both the accuracy and explainability of specialized visual classification tasks.
中文: 该研究提出的视觉拒绝采样框架通过迭代生成基于专家定义概念的可验证视觉特征合成数据,并利用无奖励模型的筛选机制进行微调,有效提升了大规模多模态模型在专业视觉分类任务中的准确性和可解释性。
English: The proposed visual rejection sampling framework enhances Large Multimodal Models' fine-grained reasoning by iteratively generating and fine-tuning with synthetic data featuring human-verifiable visual features, significantly improving both accuracy and explainability in specialized visual tasks.

Authors:Peirong Zhang, Jiaxin Zhang, Jiahuan Cao, Hongliang Li, Lianwen Jin
Title: Smaller But Better: Unifying Layout Generation with Smaller Large Language Models
Abstract:
We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI and ULR boast a succinct structure that forgoes superfluous tokens typically found in existing HTML-based formats, facilitating efficient instruction tuning and boosting unified generation performance. In addition, we propose an Interval Quantization Encoding (IQE) strategy that compresses ALI into a more condensed structure. IQE precisely preserves valid layout clues while eliminating the less informative placeholders, facilitating LGGPT to capture complex and variable layout generation conditions during the unified training process. Experimental results demonstrate that LGGPT achieves superior or on par performance compared to existing methods. Notably, LGGPT strikes a prominent balance between proficiency and efficiency with a compact 1.5B parameter LLM, which beats prior 7B or 175B models even in the most extensive and challenging unified scenario. Furthermore, we underscore the necessity of employing LLMs for unified layout generation and suggest that 1.5B could be an optimal parameter size by comparing LLMs of varying scales. Code is available at https://github.com/NiceRingNode/LGGPT.
Chinese: LGGPT是一种基于大语言模型的统一布局生成模型,通过任意布局指令和通用布局响应简化输入输出模板,并结合区间量化编码压缩布局数据,仅用1.5B参数就在性能与效率上超越更大模型,实现了布局生成任务的最优平衡。
English: LGGPT is a unified layout generation model that introduces Arbitrary Layout Instruction and Universal Layout Response to streamline input-output templates, along with Interval Quantization Encoding to compress layout data, achieving superior performance with a compact 1.5B parameter LLM that outperforms larger models in efficiency and effectiveness.

Authors:Xingyu Su, Haiyang Yu, Degui Zhi, Shuiwang Ji
Title: Learning to Discover Regulatory Elements for Gene Expression Prediction
Abstract:
We consider the problem of predicting gene expressions from DNA sequences. A key challenge of this task is to find the regulatory elements that control gene expressions. Here, we introduce Seq2Exp, a Sequence to Expression network explicitly designed to discover and extract regulatory elements that drive target gene expression, enhancing the accuracy of the gene expression prediction. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements. Specifically, we propose to decompose the epigenomic signals and the DNA sequence conditioned on the causal active regulatory elements, and apply an information bottleneck with the Beta distribution to combine their effects while filtering out non-causal components. Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and discovers influential regions compared to commonly used statistical methods for peak detection such as MACS3. The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/).
Chinese: Seq2Exp是一种新型网络,通过识别并利用DNA序列中的因果调控元件来预测基因表达,在准确性和关键区域检测方面优于现有方法。
English: Seq2Exp is a novel network that predicts gene expression by identifying and utilizing causal regulatory elements in DNA sequences, outperforming existing methods in accuracy and detection of influential regions.

Authors:Masane Fuchi, Tomohiro Takagi
Title: Erasing with Precision: Evaluating Specific Concept Erasure from Text-to-Image Generative Models
Abstract:
Studies have been conducted to prevent specific concepts from being generated from pretrained text-to-image generative models, achieving concept erasure in various ways. However, the performance evaluation of these studies is still largely reliant on visualization, with the superiority of studies often determined by human subjectivity. The metrics of quantitative evaluation also vary, making comprehensive comparisons difficult. We propose EraseEval, an evaluation method that differs from previous evaluation methods in that it involves three fundamental evaluation criteria: (1) How well does the prompt containing the target concept be reflected, (2) To what extent the concepts related to the erased concept can reduce the impact of the erased concept, and (3) Whether other concepts are preserved. These criteria are evaluated and integrated into a single metric, such that a lower score is given if any of the evaluations are low, leading to a more robust assessment. We experimentally evaluated baseline concept erasure methods, organized their characteristics, and identified challenges with them. Despite being fundamental evaluation criteria, some concept erasure methods failed to achieve high scores, which point toward future research directions for concept erasure methods. Our code is available at https://github.com/fmp453/erase-eval.
中文: 本文提出EraseEval评估框架,通过将三个核心标准整合为单一指标来系统评估文本到图像模型的概念消除效果,实验发现现有方法在此标准下存在不足,为未来研究指明了方向。
English: This paper introduces EraseEval, a novel evaluation framework for concept erasure in text-to-image models that integrates three key criteria into a single robust metric, addressing limitations in current visualization-dependent assessments and revealing challenges in existing methods through experimental analysis.

Authors:Taishi Ito, Yuki Endo, Yoshihiro Kanamori
Title: SelfAge: Personalized Facial Age Transformation Using Self-reference Images
Abstract:
Age transformation of facial images is a technique that edits age-related person's appearances while preserving the identity. Existing deep learning-based methods can reproduce natural age transformations; however, they only reproduce averaged transitions and fail to account for individual-specific appearances influenced by their life histories. In this paper, we propose the first diffusion model-based method for personalized age transformation. Our diffusion model takes a facial image and a target age as input and generates an age-edited face image as output. To reflect individual-specific features, we incorporate additional supervision using self-reference images, which are facial images of the same person at different ages. Specifically, we fine-tune a pretrained diffusion model for personalized adaptation using approximately 3 to 5 self-reference images. Additionally, we design an effective prompt to enhance the performance of age editing and identity preservation. Experiments demonstrate that our method achieves superior performance both quantitatively and qualitatively compared to existing methods. The code and the pretrained model are available at https://github.com/shiiiijp/SelfAge.
中文摘要:本文提出了一种基于扩散模型的个性化年龄转换方法,通过引入自参考图像来保留个体特征,在定量和定性评估中均优于现有技术。
English Summary: This paper introduces a personalized age transformation method using a diffusion model that incorporates self-reference images to preserve individual-specific features, outperforming existing techniques in both quantitative and qualitative evaluations.

Authors:Eduard Chelebian, Pratiti Dasgupta, Zainalabedin Samadi, Carolina Wählby, Amjad Askary
Title: Segmentation-free integration of nuclei morphology and spatial transcriptomics for retinal images
Abstract:
This study introduces SEFI (SEgmentation-Free Integration), a novel method for integrating morphological features of cell nuclei with spatial transcriptomics data. Cell segmentation poses a significant challenge in the analysis of spatial transcriptomics data, as tissue-specific structural complexities and densely packed cells in certain regions make it difficult to develop a universal approach. SEFI addresses this by utilizing self-supervised learning to extract morphological features from fluorescent nuclear staining images, enhancing the clustering of gene expression data without requiring segmentation. We demonstrate SEFI on spatially resolved gene expression profiles of the developing retina, acquired using multiplexed single molecule Fluorescence In Situ Hybridization (smFISH). SEFI is publicly available at https://github.com/eduardchelebian/sefi.
中文: 本研究提出SEFI方法,通过自监督学习从荧光核染色图像中提取形态特征,无需细胞分割即可整合空间转录组数据,并在发育视网膜数据中验证了其提升基因表达聚类的效果。
English: This study presents SEFI, a segmentation-free method that integrates nuclear morphological features with spatial transcriptomics data through self-supervised learning, improving gene expression clustering without cell segmentation, as demonstrated on developing retina data.

Authors:Yan Huang, Yongru Chen, Lei Cao, Yongnian Cao, Xuechun Yang, Yilin Dong, Tianyu Liu
Title: IncepFormerNet: A multi-scale multi-head attention network for SSVEP classification
Abstract:
In recent years, deep learning (DL) models have shown outstanding performance in EEG classification tasks, particularly in Steady-State Visually Evoked Potential(SSVEP)-based Brain-Computer-Interfaces(BCI)systems. DL methods have been successfully applied to SSVEP-BCI. This study proposes a new model called IncepFormerNet, which is a hybrid of the Inception and Transformer architectures. IncepFormerNet adeptly extracts multi-scale temporal information from time series data using parallel convolution kernels of varying sizes, accurately capturing the subtle variations and critical features within SSVEP signals.Furthermore, the model integrates the multi-head attention mechanism from the Transformer architecture, which not only provides insights into global dependencies but also significantly enhances the understanding and representation of complex patterns.Additionally, it takes advantage of filter bank techniques to extract features based on the spectral characteristics of SSVEP data. To validate the effectiveness of the proposed model, we conducted experiments on two public datasets, . The experimental results show that IncepFormerNet achieves an accuracy of 87.41 on Dataset 1 and 71.97 on Dataset 2 using a 1.0-second time window. To further verify the superiority of the proposed model, we compared it with other deep learning models, and the results indicate that our method achieves significantly higher accuracy than the others.The source codes in this work are available at: https://github.com/CECNL/SSVEP-DAN.
中文: 本研究提出了一种名为IncepFormerNet的混合模型,融合了Inception和Transformer架构,能有效提取SSVEP信号的多尺度时间特征和全局依赖关系,在公开数据集上相比其他深度学习方法取得了更高的准确率。
English: This study introduces IncepFormerNet, a hybrid model combining Inception and Transformer architectures, which effectively captures multi-scale temporal features and global dependencies in SSVEP signals, achieving superior accuracy on public datasets compared to other deep learning methods.

Authors:William Jurayj, Jeffrey Cheng, Benjamin Van Durme
Title: Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
Abstract:
Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.
中文摘要:提升大型语言模型的测试时计算量不仅能提高答案准确率,还能增强对正确答案的置信度,这促使我们建立包含响应风险阈值的新型评估体系。
English Summary: Increasing test-time compute in large language models not only improves accuracy but also boosts confidence in correct answers, prompting a new evaluation approach that incorporates response risk thresholds.

Authors:Reza Averly, Frazier N. Baker, Ian A. Watson, Xia Ning
Title: LIDDIA: Language-based Intelligent Drug Discovery Agent
Abstract:
Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDIA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDIA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDIA , demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it identifies one promising novel candidate on AR/NR3C4, a critical target for both prostate and breast cancers. Code and dataset are available at https://github.com/ninglab/LIDDiA
中文:LIDDIA是一种自主人工智能代理,利用大型语言模型智能引导药物发现过程,在生成符合药物标准的分子和识别关键癌症靶点的新候选物方面表现出色。
English: LIDDIA is an autonomous AI agent that leverages large language models to intelligently navigate the drug discovery process, demonstrating high success in generating molecules meeting pharmaceutical criteria and identifying novel candidates for critical cancer targets.

Authors:Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang
Title: RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with external knowledge sources for adaptive information retrieval. However, existing agentic RAG methods often depend on ad-hoc prompt engineering and lack a unified optimization framework. We introduce RAG-Gym, a comprehensive platform that systematically explores three optimization dimensions: (1) prompt engineering, (2) actor tuning, and (3) critic training. For prompt engineering, we propose Re$^2$Search, a novel agent incorporating reasoning reflection that significantly outperforms standard prompts. In actor tuning, we evaluate three popular post-training algorithms with fine-grained process supervision and identify direct preference optimization as the most effective. We further demonstrate that a trained critic can enhance inference by selecting higher-quality intermediate reasoning steps. Together, these findings lead to the optimized Re$^2$Search++ agent, which surpasses most recent methods like Search-R1 by a relative increase of 3.2% to 11.6% in average F1. Finally, we examine the impact of different reward sources and analyze scaling properties in training and inference, offering practical insights for agentic RAG optimization. The project homepage is available at https://rag-gym.github.io.
Chinese: 摘要介绍了RAG-Gym平台,该平台通过提示工程、行动者调优和评判器训练来优化代理式检索增强生成,最终开发的Re$^2$Search++智能体在性能指标上显著超越了现有最新方法。
English: The abstract introduces RAG-Gym, a platform that optimizes agentic RAG through prompt engineering, actor tuning, and critic training, resulting in the enhanced Re$^2$Search++ agent which significantly outperforms recent methods in performance metrics.

Authors:Jingwang Huang, Jiang Zhong, Qin Lei, Jinpeng Gao, Yuming Yang, Sirui Wang, Peiguang Li, Kaiwen Wei
Title: Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion Recognition
Abstract:
Multimodal multi-label emotion recognition (MMER) aims to identify the concurrent presence of multiple emotions in multimodal data. Existing studies primarily focus on improving fusion strategies and modeling modality-to-label dependencies. However, they often overlook the impact of \textbf{aleatoric uncertainty}, which is the inherent noise in the multimodal data and hinders the effectiveness of modality fusion by introducing ambiguity into feature representations. To address this issue and effectively model aleatoric uncertainty, this paper proposes Latent emotional Distribution Decomposition with Uncertainty perception (LDDU) framework from a novel perspective of latent emotional space probabilistic modeling. Specifically, we introduce a contrastive disentangled distribution mechanism within the emotion space to model the multimodal data, allowing for the extraction of semantic features and uncertainty. Furthermore, we design an uncertainty-aware fusion multimodal method that accounts for the dispersed distribution of uncertainty and integrates distribution information. Experimental results show that LDDU achieves state-of-the-art performance on the CMU-MOSEI and M$^3$ED datasets, highlighting the importance of uncertainty modeling in MMER. Code is available at https://github.com/201983290498/lddu\_mmer.git.
中文摘要:LDDU框架通过潜在情感空间概率建模和不确定性感知融合方法,有效解决多模态多标签情感识别中的随机不确定性,在基准数据集上取得了最优性能。
English Summary: The LDDU framework addresses aleatoric uncertainty in multimodal multi-label emotion recognition by modeling latent emotional distributions and employing uncertainty-aware fusion, achieving state-of-the-art results on benchmark datasets.

Authors:Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing
Title: LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, LongPO-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales. Our code is available at https://github.com/DAMO-NLP-SG/LongPO.
中文: LongPO通过让短上下文大语言模型利用内部能力转移和自生成的偏好数据来自我进化,以胜任长上下文任务,在保持短上下文性能的同时,实现了与GPT-4-128K等先进模型相媲美甚至更优的长上下文表现。
English: LongPO enables short-context LLMs to self-evolve for long-context tasks by leveraging internal capability transfer and self-generated preference data, maintaining short-context performance while achieving superior results comparable to advanced models like GPT-4-128K.

Authors:Xingbo Wang, Janessa Griffith, Daniel A. Adler, Joey Castillo, Tanzeem Choudhury, Fei Wang
Title: Exploring Personalized Health Support through Data-Driven, Theory-Guided LLMs: A Case Study in Sleep Health
Abstract:
Despite the prevalence of sleep-tracking devices, many individuals struggle to translate data into actionable improvements in sleep health. Current methods often provide data-driven suggestions but may not be feasible and adaptive to real-life constraints and individual contexts. We present HealthGuru, a novel large language model-powered chatbot to enhance sleep health through data-driven, theory-guided, and adaptive recommendations with conversational behavior change support. HealthGuru's multi-agent framework integrates wearable device data, contextual information, and a contextual multi-armed bandit model to suggest tailored sleep-enhancing activities. The system facilitates natural conversations while incorporating data-driven insights and theoretical behavior change techniques. Our eight-week in-the-wild deployment study with 16 participants compared HealthGuru to a baseline chatbot. Results show improved metrics like sleep duration and activity scores, higher quality responses, and increased user motivation for behavior change with HealthGuru. We also identify challenges and design considerations for personalization and user engagement in health chatbots.
中文摘要:HealthGuru是一种新型的基于大语言模型的聊天机器人,通过个性化睡眠建议和对话式支持,在实际应用中显著改善了用户睡眠指标并提升了参与度。
English Summary: HealthGuru is a novel LLM-powered chatbot that provides personalized, adaptive sleep recommendations and conversational support, demonstrating improved sleep metrics and user engagement in real-world testing.

Authors:Jaesung Tae, Hamish Ivison, Sachin Kumar, Arman Cohan
Title: TESS 2: A Large-Scale Generalist Diffusion Language Model
Abstract:
We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We find that adaptation training as well as the choice of the base model is crucial for training good instruction-following diffusion models. We further propose reward guidance, a novel and modular inference-time guidance procedure to align model outputs without needing to train the underlying model. Finally, we show that TESS 2 further improves with increased inference-time compute, highlighting the utility of diffusion LMs in having fine-grained controllability over the amount of compute used at inference time. Code and models are available at https://github.com/hamishivi/tess-2.
中文:TESS 2 是一种扩散语言模型,通过适应性训练和创新的奖励引导技术,在遵循指令方面优于同类扩散模型,并能与自回归模型相媲美,同时具备推理计算可控性。
English: TESS 2 is a diffusion language model that excels in following instructions, surpassing similar diffusion models and competing with autoregressive models through adaptation training and a novel reward guidance technique for output alignment.

Authors:Sein Kim, Hongseok Kang, Kibum Kim, Jiwan Kim, Donghyun Kim, Minchul Yang, Kwangjin Oh, Julian McAuley, Chanyoung Park
Title: Lost in Sequence: Do Large Language Models Understand Sequential Recommendation?
Abstract:
Large Language Models (LLMs) have recently emerged as promising tools for recommendation thanks to their advanced textual understanding ability and context-awareness. Despite the current practice of training and evaluating LLM-based recommendation (LLM4Rec) models under a sequential recommendation scenario, we found that whether these models understand the sequential information inherent in users' item interaction sequences has been largely overlooked. In this paper, we first demonstrate through a series of experiments that existing LLM4Rec models do not fully capture sequential information both during training and inference. Then, we propose a simple yet effective LLM-based sequential recommender, called LLM-SRec, a method that enhances the integration of sequential information into LLMs by distilling the user representations extracted from a pre-trained CF-SRec model into LLMs. Our extensive experiments show that LLM-SRec enhances LLMs' ability to understand users' item interaction sequences, ultimately leading to improved recommendation performance. Furthermore, unlike existing LLM4Rec models that require fine-tuning of LLMs, LLM-SRec achieves state-of-the-art performance by training only a few lightweight MLPs, highlighting its practicality in real-world applications. Our code is available at https://github.com/Sein-Kim/LLM-SRec.
中文: 本文发现现有基于大语言模型的推荐系统未能充分理解序列信息,并提出LLM-SRec模型,通过从预训练模型中蒸馏用户表征来增强序列建模能力,仅需训练少量参数即可实现最优性能。
English: This paper reveals that current LLM-based recommendation models inadequately capture sequential information and proposes LLM-SRec, a method that enhances sequence understanding through knowledge distillation from pre-trained models while achieving state-of-the-art performance with minimal trainable parameters.

Authors:Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue
Title: DataSciBench: An LLM Agent Benchmark for Data Science
Abstract:
This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.
中文: 本文提出了DataSciBench,这是一个通过半自动化流程和任务-函数-代码框架来全面评估大语言模型在数据科学领域能力的新型基准测试,实验表明基于API的模型在所有指标上均优于开源模型。
English: This paper introduces DataSciBench, a novel benchmark designed to comprehensively assess Large Language Models' capabilities in data science through a semi-automated pipeline and a Task-Function-Code framework, revealing that API-based models consistently outperform open-source alternatives.

Authors:Idris Hamoud, Vinkle Srivastav, Muhammad Abdullah Jamal, Didier Mutter, Omid Mohareri, Nicolas Padoy
Title: Multi-view Video-Pose Pretraining for Operating Room Surgical Activity Recognition
Abstract:
Understanding the workflow of surgical procedures in complex operating rooms requires a deep understanding of the interactions between clinicians and their environment. Surgical activity recognition (SAR) is a key computer vision task that detects activities or phases from multi-view camera recordings. Existing SAR models often fail to account for fine-grained clinician movements and multi-view knowledge, or they require calibrated multi-view camera setups and advanced point-cloud processing to obtain better results. In this work, we propose a novel calibration-free multi-view multi-modal pretraining framework called Multiview Pretraining for Video-Pose Surgical Activity Recognition PreViPS, which aligns 2D pose and vision embeddings across camera views. Our model follows CLIP-style dual-encoder architecture: one encoder processes visual features, while the other encodes human pose embeddings. To handle the continuous 2D human pose coordinates, we introduce a tokenized discrete representation to convert the continuous 2D pose coordinates into discrete pose embeddings, thereby enabling efficient integration within the dual-encoder framework. To bridge the gap between these two modalities, we propose several pretraining objectives using cross- and in-modality geometric constraints within the embedding space and incorporating masked pose token prediction strategy to enhance representation learning. Extensive experiments and ablation studies demonstrate improvements over the strong baselines, while data-efficiency experiments on two distinct operating room datasets further highlight the effectiveness of our approach. We highlight the benefits of our approach for surgical activity recognition in both multi-view and single-view settings, showcasing its practical applicability in complex surgical environments. Code will be made available at: https://github.com/CAMMA-public/PreViPS.
Chinese: 本文提出PreViPS,一种免校准的多视角多模态预训练框架,通过跨视角对齐2D姿态与视觉嵌入,解决了现有手术活动识别模型在细粒度动作和多视角知识整合方面的不足。
English: This paper introduces PreViPS, a calibration-free multi-view multi-modal pretraining framework that aligns 2D pose and vision embeddings across camera views to improve surgical activity recognition by addressing limitations in existing models regarding fine-grained movements and multi-view integration.

Authors:Jiahao Liu, Xueshuo Yan, Dongsheng Li, Guangping Zhang, Hansu Gu, Peng Zhang, Tun Lu, Li Shang, Ning Gu
Title: Improving LLM-powered Recommendations with Personalized Information
Abstract:
Due to the lack of explicit reasoning modeling, existing LLM-powered recommendations fail to leverage LLMs' reasoning capabilities effectively. In this paper, we propose a pipeline called CoT-Rec, which integrates two key Chain-of-Thought (CoT) processes -- user preference analysis and item perception analysis -- into LLM-powered recommendations, thereby enhancing the utilization of LLMs' reasoning abilities. CoT-Rec consists of two stages: (1) personalized information extraction, where user preferences and item perception are extracted, and (2) personalized information utilization, where this information is incorporated into the LLM-powered recommendation process. Experimental results demonstrate that CoT-Rec shows potential for improving LLM-powered recommendations. The implementation is publicly available at https://github.com/jhliu0807/CoT-Rec.
中文: 现有基于大语言模型的推荐系统未能充分利用其推理能力,因此我们提出CoT-Rec框架,通过整合思维链过程分析用户偏好和物品认知,有效提升了推荐性能。
English: Current LLM-based recommendations underutilize reasoning capabilities, so we propose CoT-Rec, a pipeline incorporating Chain-of-Thought processes for user preference and item perception analysis to enhance recommendation effectiveness.

Authors:Jiahao Liu, Shengkang Gu, Dongsheng Li, Guangping Zhang, Mingzhe Han, Hansu Gu, Peng Zhang, Tun Lu, Li Shang, Ning Gu
Title: AgentCF++: Memory-enhanced LLM-based Agents for Popularity-aware Cross-domain Recommendations
Abstract:
LLM-based user agents, which simulate user interaction behavior, are emerging as a promising approach to enhancing recommender systems. In real-world scenarios, users' interactions often exhibit cross-domain characteristics and are influenced by others. However, the memory design in current methods causes user agents to introduce significant irrelevant information during decision-making in cross-domain scenarios and makes them unable to recognize the influence of other users' interactions, such as popularity factors. To tackle this issue, we propose a dual-layer memory architecture combined with a two-step fusion mechanism. This design avoids irrelevant information during decision-making while ensuring effective integration of cross-domain preferences. We also introduce the concepts of interest groups and group-shared memory to better capture the influence of popularity factors on users with similar interests. Comprehensive experiments validate the effectiveness of AgentCF++. Our code is available at https://github.com/jhliu0807/AgentCF-plus.
中文摘要:该研究提出的双层记忆架构与两步融合机制解决了现有LLM用户代理的不足,既能有效过滤跨域场景中的无关信息,又通过群体共享记忆模块准确捕捉流行度对相似兴趣用户的影响。
English Summary: The proposed dual-layer memory architecture and two-step fusion mechanism address the limitations of current LLM-based user agents by effectively filtering irrelevant information in cross-domain scenarios and incorporating group-shared memory to capture popularity influences.

Authors:Jiahao Liu, Dongsheng Li, Hansu Gu, Peng Zhang, Tun Lu, Li Shang, Ning Gu
Title: Unbiased Collaborative Filtering with Fair Sampling
Abstract:
Recommender systems leverage extensive user interaction data to model preferences; however, directly modeling these data may introduce biases that disproportionately favor popular items. In this paper, we demonstrate that popularity bias arises from the influence of propensity factors during training. Building on this insight, we propose a fair sampling (FS) method that ensures each user and each item has an equal likelihood of being selected as both positive and negative instances, thereby mitigating the influence of propensity factors. The proposed FS method does not require estimating propensity scores, thus avoiding the risk of failing to fully eliminate popularity bias caused by estimation inaccuracies. Comprehensive experiments demonstrate that the proposed FS method achieves state-of-the-art performance in both point-wise and pair-wise recommendation tasks. The code implementation is available at https://github.com/jhliu0807/Fair-Sampling.
中文摘要:本文发现推荐系统中的流行度偏差源于训练过程中的倾向性因素,并提出一种公平采样方法,无需估计倾向性分数即可确保用户和项目被平等选为正负实例,从而实现了最先进的性能表现。
English Summary: This paper identifies that popularity bias in recommender systems stems from propensity factors during training and introduces a fair sampling method that ensures equal selection probability for users and items without needing propensity score estimation, achieving state-of-the-art performance.

Authors:Zenan Li, Zhaoyu Li, Wen Tang, Xian Zhang, Yuan Yao, Xujie Si, Fan Yang, Kaiyu Yang, Xiaoxing Ma
Title: Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning
Abstract:
Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof system. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.
Chinese: 大型语言模型通过神经符号策略生成器,结合其数学直觉与符号方法的领域知识,在无需额外训练数据的情况下,在复杂不等式问题上实现了最优性能。
English: Large language models (LLMs) are enhanced by a neuro-symbolic tactic generator that combines their learned mathematical intuition with symbolic methods' domain insights, achieving state-of-the-art results on challenging inequalities without extra training data.

Authors:Matthew Wood, Mathieu Klop, Maxime Allard
Title: Helix-mRNA: A Hybrid Foundation Model For Full Sequence mRNA Therapeutics
Abstract:
mRNA-based vaccines have become a major focus in the pharmaceutical industry. The coding sequence as well as the Untranslated Regions (UTRs) of an mRNA can strongly influence translation efficiency, stability, degradation, and other factors that collectively determine a vaccine's effectiveness. However, optimizing mRNA sequences for those properties remains a complex challenge. Existing deep learning models often focus solely on coding region optimization, overlooking the UTRs. We present Helix-mRNA, a structured state-space-based and attention hybrid model to address these challenges. In addition to a first pre-training, a second pre-training stage allows us to specialise the model with high-quality data. We employ single nucleotide tokenization of mRNA sequences with codon separation, ensuring prior biological and structural information from the original mRNA sequence is not lost. Our model, Helix-mRNA, outperforms existing methods in analysing both UTRs and coding region properties. It can process sequences 6x longer than current approaches while using only 10% of the parameters of existing foundation models. Its predictive capabilities extend to all mRNA regions. We open-source the model (https://github.com/helicalAI/helical) and model weights (https://huggingface.co/helical-ai/helix-mRNA).
Chinese: Helix-mRNA 是一种混合深度学习模型,能同时优化 mRNA 的编码区和非翻译区,在效率和预测能力上超越现有方法,且参数更少、可处理更长序列。
English: Helix-mRNA is a hybrid deep learning model that optimizes both coding and untranslated regions of mRNA sequences, outperforming existing methods in efficiency and predictive capabilities while using fewer parameters and handling longer sequences.

Authors:Jiaqi Li, Xizhong Guo, Yang Zhao, Lvyang Zhang, Lidong Zhai
Title: Poster: SpiderSim: Multi-Agent Driven Theoretical Cybersecurity Simulation for Industrial Digitalization
Abstract:
Rapid industrial digitalization has created intricate cybersecurity demands that necessitate effective validation methods. While cyber ranges and simulation platforms are widely deployed, they frequently face limitations in scenario diversity and creation efficiency. In this paper, we present SpiderSim, a theoretical cybersecurity simulation platform enabling rapid and lightweight scenario generation for industrial digitalization security research. At its core, our platform introduces three key innovations: a structured framework for unified scenario modeling, a multi-agent collaboration mechanism for automated generation, and modular atomic security capabilities for flexible scenario composition. Extensive implementation trials across multiple industrial digitalization contexts, including marine ranch monitoring systems, validate our platform's capacity for broad scenario coverage with efficient generation processes. Built on solid theoretical foundations and released as open-source software, SpiderSim facilitates broader research and development in automated security testing for industrial digitalization.
中文: SpiderSim是一个理论性网络安全仿真平台,通过统一场景建模、自动化多智能体协作和模块化安全能力三大创新,为工业数字化安全研究提供快速轻量的场景生成方案。
English: SpiderSim is a theoretical cybersecurity simulation platform that enables rapid, lightweight scenario generation for industrial digitalization security research through three key innovations: unified scenario modeling, automated multi-agent collaboration, and modular security capabilities.

Authors:Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, Yancheng Yuan, Dacheng Tao
Title: Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding
Abstract:
Large language models (LLMs) excel at a range of tasks through in-context learning (ICL), where only a few task examples guide their predictions. However, prior research highlights that LLMs often overlook input-label mapping information in ICL, relying more on their pre-trained knowledge. To address this issue, we introduce In-Context Contrastive Decoding (ICCD), a novel method that emphasizes input-label mapping by contrasting the output distributions between positive and negative in-context examples. Experiments on 7 natural language understanding (NLU) tasks show that our ICCD method brings consistent and significant improvement (up to +1.8 improvement on average) upon 6 different scales of LLMs without requiring additional training. Our approach is versatile, enhancing performance with various demonstration selection methods, demonstrating its broad applicability and effectiveness. The code and scripts are released at https://github.com/Romainpkq/CD_ICL.
中文摘要:提出的上下文对比解码(ICCD)方法通过对比分析强化输入标签映射,无需额外训练即可在多类自然语言理解任务中持续提升大语言模型的性能表现。
English Summary: The proposed In-Context Contrastive Decoding (ICCD) method improves large language models' performance by emphasizing input-label mapping through contrastive analysis, achieving consistent gains across multiple NLU tasks without additional training.

Authors:Taewoo Kim, Yujeong Chae, Hyun-Kurl Jang, Kuk-Jin Yoon
Title: Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields
Abstract:
Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only events or approximations, which can not consider the complex motion in real-world scenarios. In this paper, we propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. In detail, our EIF-BiOFNet utilizes each valuable characteristic of the events and images for direct estimation of inter-frame motion fields without any approximation methods. Moreover, we develop an interactive attention-based frame synthesis network to efficiently leverage the complementary warping-based and synthesis-based features. Finally, we build a large-scale event-based VFI dataset, ERF-X170FPS, with a high frame rate, extreme motion, and dynamic textures to overcome the limitations of previous event-based VFI datasets. Extensive experimental results validate that our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets. Our project pages are available at: https://github.com/intelpro/CBMNet
中文: 本文提出了一种新颖的基于事件的视频帧插值框架,通过跨模态非对称双向运动场估计和交互式注意力合成网络,在新构建的高帧率数据集上实现了超越现有方法的性能表现。
English: This paper introduces a novel event-based video frame interpolation framework that utilizes cross-modal asymmetric bidirectional motion field estimation and an interactive attention-based synthesis network, achieving superior performance on a newly created high-frame-rate dataset.

Authors:Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng
Title: MoM: Linear Sequence Modeling with Mixture-of-Memories
Abstract:
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.
Chinese: Mixture-of-Memories (MoM) 架构通过采用带路由网络的多个独立记忆状态,显著提升了线性序列模型在记忆密集型任务上的性能,同时保持了线性训练复杂度和常数推理复杂度的优势。
English: The Mixture-of-Memories (MoM) architecture enhances linear sequence models by employing multiple independent memory states with a router network, significantly improving performance on recall-intensive tasks while maintaining linear training and constant inference complexity.

Authors:Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng
Title: MoM: Linear Sequence Modeling with Mixture-of-Memories
Abstract:
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive tasks. To address this limitation, we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. MoM serves as a general framework that can be seamlessly combined with diverse memory update mechanisms across linear models. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.
Chinese: Mixture-of-Memories (MoM) 架构通过采用带路由网络的多个独立记忆状态,显著提升了线性序列模型在记忆密集型任务上的性能,同时保持了线性训练复杂度和常数推理复杂度的优势。
English: The Mixture-of-Memories (MoM) architecture enhances linear sequence models by employing multiple independent memory states with a router network, significantly improving performance on recall-intensive tasks while maintaining linear training and constant inference complexity.

Authors:Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, Cuiyun Gao
Title: Repo2Run: Automated Building Executable Environment for Code Repository at Scale
Abstract:
Scaling up executable code data is significant for improving language models' software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.
中文: 扩展可执行代码数据对提升语言模型的软件工程能力至关重要,Repo2Run作为首个基于大语言模型的代理,能自动为代码仓库构建测试环境,成功率达到86.0%,远超现有方法。
English: Scaling executable code data is crucial for enhancing language models' software engineering capabilities, and Repo2Run, an LLM-based agent, automates the building of test environments for repositories, achieving an 86.0% success rate and significantly outperforming existing methods.

Authors:Tim Baumgärtner, Ted Briscoe, Iryna Gurevych
Title: PeerQA: A Scientific Question Answering Dataset from Peer Reviews
Abstract:
We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens. Our code and data is available at https://github.com/UKPLab/peerqa.
中文: PeerQA是一个基于同行评审问题的科学文档级问答数据集,包含作者标注的答案,支持证据检索、不可回答问题分类和答案生成三大任务,并揭示了去语境化对提升检索性能的关键作用。
English: PeerQA is a scientific document-level QA dataset derived from peer review questions with author-annotated answers, supporting evidence retrieval, unanswerable question classification, and answer generation tasks while demonstrating the importance of decontextualization for retrieval performance.

Authors:Rokuto Nagata, Kenji Koide, Yuki Hayakawa, Ryo Suzuki, Kazuma Ikeda, Ozora Sako, Qi Alfred Chen, Takami Sato, Kentaro Yoshioka
Title: SLAMSpoof: Practical LiDAR Spoofing Attacks on Localization Systems Guided by Scan Matching Vulnerability Analysis
Abstract:
Accurate localization is essential for enabling modern full self-driving services. These services heavily rely on map-based traffic information to reduce uncertainties in recognizing lane shapes, traffic light locations, and traffic signs. Achieving this level of reliance on map information requires centimeter-level localization accuracy, which is currently only achievable with LiDAR sensors. However, LiDAR is known to be vulnerable to spoofing attacks that emit malicious lasers against LiDAR to overwrite its measurements. Once localization is compromised, the attack could lead the victim off roads or make them ignore traffic lights. Motivated by these serious safety implications, we design SLAMSpoof, the first practical LiDAR spoofing attack on localization systems for self-driving to assess the actual attack significance on autonomous vehicles. SLAMSpoof can effectively find the effective attack location based on our scan matching vulnerability score (SMVS), a point-wise metric representing the potential vulnerability to spoofing attacks. To evaluate the effectiveness of the attack, we conduct real-world experiments on ground vehicles and confirm its high capability in real-world scenarios, inducing position errors of $\geq$4.2 meters (more than typical lane width) for all 3 popular LiDAR-based localization algorithms. We finally discuss the potential countermeasures of this attack. Code is available at https://github.com/Keio-CSG/slamspoof
中文摘要:SLAMSpoof是针对自动驾驶车辆定位系统的实用激光雷达欺骗攻击,通过扫描匹配漏洞评分发现有效攻击点,能引发超过车道宽度的定位偏差,严重威胁道路安全。
English Summary: SLAMSpoof is a practical LiDAR spoofing attack that exploits localization vulnerabilities in self-driving vehicles, causing position errors exceeding lane width and compromising road safety.

Authors:Yuanyuan Xu, Hanchen Wang, Wenjie Zhang, Lexing Xie, Yin Chen, Flora Salim, Ying Zhang, Justin Gooding, Toby Walsh
Title: AI-Empowered Catalyst Discovery: A Survey from Classical Machine Learning Approaches to Large Language Models
Abstract:
Catalysts are essential for accelerating chemical reactions and enhancing selectivity, which is crucial for the sustainable production of energy, materials, and bioactive compounds. Catalyst discovery is fundamental yet challenging in computational chemistry and has garnered significant attention due to the promising performance of advanced Artificial Intelligence (AI) techniques. The development of Large Language Models (LLMs) notably accelerates progress in the discovery of both homogeneous and heterogeneous catalysts, where their chemical reactions differ significantly in material phases, temperature, dynamics, etc. However, there is currently no comprehensive survey that discusses the progress and latest developments in both areas, particularly with the application of LLM techniques. To address this gap, this paper presents a thorough and systematic survey of AI-empowered catalyst discovery, employing a unified and general categorization for homogeneous and heterogeneous catalysts. We examine the progress of AI-empowered catalyst discovery, highlighting their individual advantages and disadvantages, and discuss the challenges faced in this field. Furthermore, we suggest potential directions for future research from the perspective of computer science. Our goal is to assist researchers in computational chemistry, computer science, and related fields in easily tracking the latest advancements, providing a clear overview and roadmap of this area. We also organize and make accessible relevant resources, including article lists and datasets, in an open repository at https://github.com/LuckyGirl-XU/Awesome-Artificial-Intelligence-Empowered-Catalyst-Discovery.
中文摘要:本文对人工智能驱动的催化剂发现进行全面综述,系统梳理均相与非均相催化剂的研究进展,探讨当前挑战与未来方向,旨在为跨学科研究者提供清晰领域概览与资源支持。
English Summary: This paper provides a comprehensive survey of AI-driven catalyst discovery, systematically reviewing progress in both homogeneous and heterogeneous catalysts while addressing current challenges and future research directions to assist interdisciplinary researchers.

Authors:Zheng Wu, Yiping Xie, Bo Zhao, Jiguang He, Fei Luo, Ning Deng, Zitong Yu
Title: CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological Measurement
Abstract:
Heart rate (HR) estimation via remote photoplethysmography (rPPG) offers a non-invasive solution for health monitoring. However, traditional single-modality approaches (RGB or Radio Frequency (RF)) face challenges in balancing robustness and accuracy due to lighting variations, motion artifacts, and skin tone bias. In this paper, we propose CardiacMamba, a multimodal RGB-RF fusion framework that leverages the complementary strengths of both modalities. It introduces the Temporal Difference Mamba Module (TDMM) to capture dynamic changes in RF signals using timing differences between frames, enhancing the extraction of local and global features. Additionally, CardiacMamba employs a Bidirectional SSM for cross-modal alignment and a Channel-wise Fast Fourier Transform (CFFT) to effectively capture and refine the frequency domain characteristics of RGB and RF signals, ultimately improving heart rate estimation accuracy and periodicity detection. Extensive experiments on the EquiPleth dataset demonstrate state-of-the-art performance, achieving marked improvements in accuracy and robustness. CardiacMamba significantly mitigates skin tone bias, reducing performance disparities across demographic groups, and maintains resilience under missing-modality scenarios. By addressing critical challenges in fairness, adaptability, and precision, the framework advances rPPG technology toward reliable real-world deployment in healthcare. The codes are available at: https://github.com/WuZheng42/CardiacMamba.
中文: CardiacMamba是一种多模态RGB-RF融合框架,通过整合互补特征提升心率估计性能,在医疗应用中实现了更高的准确性、鲁棒性并显著降低了肤色偏差。
English: CardiacMamba is a multimodal RGB-RF fusion framework that enhances heart rate estimation by integrating complementary features, achieving superior accuracy, robustness, and reduced skin tone bias in real-world healthcare applications.

Authors:DongGeon Lee, Hwanjo Yu
Title: REFIND at SemEval-2025 Task 3: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models
Abstract:
Hallucinations in large language model (LLM) outputs severely limit their reliability in knowledge-intensive tasks such as question answering. To address this challenge, we introduce REFIND (Retrieval-augmented Factuality hallucINation Detection), a novel framework that detects hallucinated spans within LLM outputs by directly leveraging retrieved documents. As part of the REFIND, we propose the Context Sensitivity Ratio (CSR), a novel metric that quantifies the sensitivity of LLM outputs to retrieved evidence. This innovative approach enables REFIND to efficiently and accurately detect hallucinations, setting it apart from existing methods. In the evaluation, REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models, achieving superior IoU scores in identifying hallucinated spans. This work highlights the effectiveness of quantifying context sensitivity for hallucination detection, thereby paving the way for more reliable and trustworthy LLM applications across diverse languages. Our code is available at https://github.com/oneonlee/REFIND.
中文摘要:REFIND框架通过检索文档和创新的语境敏感度比率指标,能有效检测大语言模型输出中的幻觉内容,在多语言环境下表现出强大性能并显著优于基线模型。
English Summary: The REFIND framework effectively detects hallucinations in LLM outputs by using retrieved documents and a novel Context Sensitivity Ratio metric, demonstrating robust performance across multiple languages and outperforming baseline models.

Authors:Ziming Hong, Yongli Xiang, Tongliang Liu
Title: Toward Robust Non-Transferable Learning: A Survey and Benchmark
Abstract:
Over the past decades, researchers have primarily focused on improving the generalization abilities of models, with limited attention given to regulating such generalization. However, the ability of models to generalize to unintended data (e.g., harmful or unauthorized data) can be exploited by malicious adversaries in unforeseen ways, potentially resulting in violations of model ethics. Non-transferable learning (NTL), a task aimed at reshaping the generalization abilities of deep learning models, was proposed to address these challenges. While numerous methods have been proposed in this field, a comprehensive review of existing progress and a thorough analysis of current limitations remain lacking. In this paper, we bridge this gap by presenting the first comprehensive survey on NTL and introducing NTLBench, the first benchmark to evaluate NTL performance and robustness within a unified framework. Specifically, we first introduce the task settings, general framework, and criteria of NTL, followed by a summary of NTL approaches. Furthermore, we emphasize the often-overlooked issue of robustness against various attacks that can destroy the non-transferable mechanism established by NTL. Experiments conducted via NTLBench verify the limitations of existing NTL methods in robustness. Finally, we discuss the practical applications of NTL, along with its future directions and associated challenges.
Chinese Summary: 本文首次对不可迁移学习进行全面综述,提出了首个评估NTL方法性能与鲁棒性的基准NTLBench,同时揭示了现有方法在抗攻击鲁棒性方面的局限性。
English Summary: This paper presents the first comprehensive survey on non-transferable learning (NTL), introducing NTLBench as the inaugural benchmark to evaluate NTL methods' performance and robustness while highlighting their limitations against attacks.

Authors:Yupeng Hou, Jianmo Ni, Zhankui He, Noveen Sachdeva, Wang-Cheng Kang, Ed H. Chi, Julian McAuley, Derek Zhiyuan Cheng
Title: ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation
Abstract:
Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as a set of item features. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Our code is available at: https://github.com/google-deepmind/action_piece.
中文摘要:传统生成式推荐模型独立地对动作进行标记化,而提出的ActionPiece方法通过引入项目特征和共现模式来增强上下文感知能力,从而提升标记化准确性。
English Summary: Generative recommendation models traditionally tokenize actions independently, but the proposed ActionPiece method enhances context-awareness by incorporating item features and co-occurrence patterns to improve tokenization accuracy.

Authors:Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami
Title: ETS: Efficient Tree Search for Inference-Time Scaling
Abstract:
Test-time compute scaling has emerged as a new axis along which to improve model accuracy, where additional computation is used at inference time to allow the model to think longer for more challenging problems. One promising approach for test-time compute scaling is search against a process reward model, where a model generates multiple potential candidates at each step of the search, and these partial trajectories are then scored by a separate reward model in order to guide the search process. The diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration. However, this diversity comes at a cost, as divergent trajectories have less KV sharing, which means they consume more memory and slow down the search process. Previous search methods either do not perform sufficient exploration, or else explore diverse trajectories but have high latency. We address this challenge by proposing Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories. ETS incorporates a linear programming cost model to promote KV cache sharing by penalizing the number of nodes retained, while incorporating a semantic coverage term into the cost model to ensure that we retain trajectories which are semantically different. We demonstrate how ETS can achieve 1.8$\times$ reduction in average KV cache size during the search process, leading to 1.4$\times$ increased throughput relative to prior state-of-the-art methods, with minimal accuracy degradation and without requiring any custom kernel implementation. Code is available at: https://github.com/SqueezeAILab/ETS.
Chinese: 高效树搜索(ETS)通过剪枝冗余轨迹来优化测试时计算扩展,增强KV缓存共享,从而在保持精度的同时降低内存使用并提升吞吐量。
English: Efficient Tree Search (ETS) is a method that optimizes test-time compute scaling by pruning redundant trajectories to enhance KV cache sharing, thereby reducing memory usage and increasing throughput without significant accuracy loss.

Authors:Yuan Yao, Xiaopu Zhang, Yu Zhang, Jian Jin, Qiang Yang
Title: Noise May Contain Transferable Knowledge: Understanding Semi-supervised Heterogeneous Domain Adaptation from an Empirical Perspective
Abstract:
Semi-supervised heterogeneous domain adaptation (SHDA) addresses learning across domains with distinct feature representations and distributions, where source samples are labeled while most target samples are unlabeled, with only a small fraction labeled. Moreover, there is no one-to-one correspondence between source and target samples. Although various SHDA methods have been developed to tackle this problem, the nature of the knowledge transferred across heterogeneous domains remains unclear. This paper delves into this question from an empirical perspective. We conduct extensive experiments on about 330 SHDA tasks, employing two supervised learning methods and seven representative SHDA methods. Surprisingly, our observations indicate that both the category and feature information of source samples do not significantly impact the performance of the target domain. Additionally, noise drawn from simple distributions, when used as source samples, may contain transferable knowledge. Based on this insight, we perform a series of experiments to uncover the underlying principles of transferable knowledge in SHDA. Specifically, we design a unified Knowledge Transfer Framework (KTF) for SHDA. Based on the KTF, we find that the transferable knowledge in SHDA primarily stems from the transferability and discriminability of the source domain. Consequently, ensuring those properties in source samples, regardless of their origin (e.g., image, text, noise), can enhance the effectiveness of knowledge transfer in SHDA tasks. The codes and datasets are available at https://github.com/yyyaoyuan/SHDA.
中文: 本研究揭示了在半监督异构域自适应中,可迁移知识主要源于源域的可迁移性和可区分性,而非其具体内容,并证明只要确保这些特性,即便是合成噪声也能促进有效的知识迁移。
English: This study reveals that in semi-supervised heterogeneous domain adaptation, transferable knowledge primarily depends on the source domain's transferability and discriminability, rather than its specific content, and demonstrates that even synthetic noise can facilitate effective knowledge transfer when these properties are ensured.

Authors:Guangwei Li, Yuansen Zhang, Yinggui Wang, Shoumeng Yan, Lei Wang, Tao Wei
Title: PRIV-QA: Privacy-Preserving Question Answering for Cloud Large Language Models
Abstract:
The rapid development of large language models (LLMs) is redefining the landscape of human-computer interaction, and their integration into various user-service applications is becoming increasingly prevalent. However, transmitting user data to cloud-based LLMs presents significant risks of data breaches and unauthorized access to personal identification information. In this paper, we propose a privacy preservation pipeline for protecting privacy and sensitive information during interactions between users and LLMs in practical LLM usage scenarios. We construct SensitiveQA, the first privacy open-ended question-answering dataset. It comprises 57k interactions in Chinese and English, encompassing a diverse range of user-sensitive information within the conversations. Our proposed solution employs a multi-stage strategy aimed at preemptively securing user information while simultaneously preserving the response quality of cloud-based LLMs. Experimental validation underscores our method's efficacy in balancing privacy protection with maintaining robust interaction quality. The code and dataset are available at https://github.com/ligw1998/PRIV-QA.
中文: 本文提出了一种隐私保护流程,能在用户与大型语言模型交互时有效防护敏感数据泄露,同时保持云端模型的应答质量,并通过新构建的多语言数据集SensitiveQA验证了其有效性。
English: This paper introduces a privacy preservation pipeline that effectively safeguards sensitive user data during interactions with cloud-based large language models while maintaining response quality, as validated through a newly constructed multilingual dataset called SensitiveQA.

Authors:Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, Kunlong Zhou
Title: Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
Abstract:
Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81$\times$ (16.95$\times$), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B). Code is available at https://github.com/junzhang-zj/LoRAM.
大语言模型得益于LoRA的高效微调,但其内存使用受限于原始参数,因此LoRAM提出在剪枝后的小模型上训练,通过恢复矩阵进行推理,以降低内存需求并保持性能。
Large Language Models benefit from LoRA's efficient fine-tuning, but its memory use is constrained by original parameters, so LoRAM introduces training on a pruned model to reduce memory demands while maintaining performance through recovered matrices for inference.

Authors:Wuhan Chen, Zongwei Wang, Min Gao, Xin Xia, Feng Jiang, Junhao Wen
Title: Breaking the Clusters: Uniformity-Optimization for Text-Based Sequential Recommendation
Abstract:
Traditional sequential recommendation (SR) methods heavily rely on explicit item IDs to capture user preferences over time. This reliance introduces critical limitations in cold-start scenarios and domain transfer tasks, where unseen items and new contexts often lack established ID mappings. To overcome these limitations, recent studies have shifted towards leveraging text-only information for recommendation, thereby improving model generalization and adaptability across domains. Although promising, text-based SR faces unique difficulties: items' text descriptions often share semantic similarities that lead to clustered item representations, compromising their uniformity, a property essential for promoting diversity and enhancing generalization in recommendation systems. In this paper, we explore a novel framework to improve the uniformity of item representations in text-based SR. Our analysis reveals that items within a sequence exhibit marked semantic similarity, meaning they are closer in representation than items overall, and that this effect is more pronounced for less popular items, which form tighter clusters compared to their more popular counterparts. Based on these findings, we propose UniT, a framework that employs three pairwise item sampling strategies: Unified General Sampling Strategy, Sequence-Driven Sampling Strategy, and Popularity-Driven Sampling Strategy. Each strategy applies varying degrees of repulsion to selectively adjust the distances between item pairs, thereby refining representation uniformity while considering both sequence context and item popularity. Extensive experiments on multiple real-world datasets demonstrate that our proposed approach outperforms state-of-the-art models, validating the effectiveness of UniT in enhancing both representation uniformity and recommendation accuracy.The source code is available at https://github.com/ccwwhhh/Model-Rec.
中文: 本文提出UniT框架,通过三种成对采样策略改进基于文本的序列推荐中物品表征的均匀性,解决语义聚类问题,从而提升推荐多样性和准确性。
English: This paper introduces UniT, a novel framework that enhances the uniformity of item representations in text-based sequential recommendation by employing three pairwise sampling strategies to address semantic clustering issues, thereby improving both diversity and recommendation accuracy.

Authors:Hyeonjae Gil, Dongjae Lee, Giseop Kim, Ayoung Kim
Title: Ephemerality meets LiDAR-based Lifelong Mapping
Abstract:
Lifelong mapping is crucial for the long-term deployment of robots in dynamic environments. In this paper, we present ELite, an ephemerality-aided LiDAR-based lifelong mapping framework which can seamlessly align multiple session data, remove dynamic objects, and update maps in an end-to-end fashion. Map elements are typically classified as static or dynamic, but cases like parked cars indicate the need for more detailed categories than binary. Central to our approach is the probabilistic modeling of the world into two-stage $\textit{ephemerality}$, which represent the transiency of points in the map within two different time scales. By leveraging the spatiotemporal context encoded in ephemeralities, ELite can accurately infer transient map elements, maintain a reliable up-to-date static map, and improve robustness in aligning the new data in a more fine-grained manner. Extensive real-world experiments on long-term datasets demonstrate the robustness and effectiveness of our system. The source code is publicly available for the robotics community: https://github.com/dongjae0107/ELite.
中文: ELite是一种基于激光雷达的瞬时性辅助终身建图框架,通过两阶段瞬时性概率建模动态分类地图元素,在动态环境中实现可靠的地图更新与数据对齐。
English: ELite is an ephemerality-aided LiDAR-based lifelong mapping framework that dynamically updates maps by classifying elements with probabilistic two-stage ephemerality, enhancing robustness and accuracy in dynamic environments.

Authors:Jialin Ouyang
Title: TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation
Abstract:
Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems. The dataset generation code and sample data are available at https://github.com/j-bagel/treecut-math.
Chinese Summary: 大型语言模型在面对无解数学题时常会自信地给出错误答案,TreeCut数据集通过系统生成不可解问题,揭示了GPT-4o等模型在零样本条件下最高达64%的幻觉产生率。
English Summary: Large language models frequently provide confident but incorrect answers to unsolvable math problems, as demonstrated by the TreeCut dataset, which reveals hallucination rates up to 64% in models like GPT-4o under specific conditions.

Authors:Ziyuan Liu, Ruifei Zhu, Long Gao, Yuanxiu Zhou, Jingyu Ma, Yuantao Gu
Title: JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework
Abstract:
Change detection (CD) in remote sensing images plays a vital role in Earth observation. However, the scarcity of high-resolution, comprehensive open-source datasets and the difficulty in achieving robust performance across varying change types remain major challenges. To address these issues, we introduce JL1-CD, a large-scale, sub-meter CD dataset consisting of 5,000 image pairs. We further propose a novel Origin-Partition (O-P) strategy and integrate it into a Multi-Teacher Knowledge Distillation (MTKD) framework to enhance CD performance. The O-P strategy partitions the training set by Change Area Ratio (CAR) and trains specialized teacher models on each subset. The MTKD framework then distills complementary knowledge from these teachers into a single student model, enabling improved detection results across diverse CAR scenarios without additional inference cost. Our MTKD approach demonstrated strong performance in the 2024 ``Jilin-1'' Cup challenge, ranking first in the preliminary and second in the final rounds. Extensive experiments on the JL1-CD and SYSU-CD datasets show that the MTKD framework consistently improves the performance of CD models with various network architectures and parameter sizes, establishing new state-of-the-art results. Code and dataset are available at https://github.com/circleLZY/MTKD-CD.
中文摘要:本文提出了大规模变化检测数据集JL1-CD,并创新性地将源分区策略融入多教师知识蒸馏框架,在不增加推理成本的情况下显著提升了各类变化场景的检测性能。
English Summary: This paper introduces JL1-CD, a large-scale change detection dataset, and proposes a novel Origin-Partition strategy integrated into a Multi-Teacher Knowledge Distillation framework that enhances detection performance across diverse scenarios without increasing inference costs.

Authors:Vishal Dey, Xiao Hu, Xia Ning
Title: GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization
Abstract:
Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce MuMOInstruct, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging MuMOInstruct, we develop GeLLMOs, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that GeLLMOs consistently outperform state-of-the-art baselines. GeLLMOs also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of GeLLMOs as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. MuMOInstruct, models, and code are accessible through https://github.com/ninglab/GeLLMO.
中文: 本研究推出了首个多属性分子优化指令数据集MuMOInstruct,并开发了GeLLMOs模型,该模型在多项任务中超越现有方法,展现出卓越的零样本泛化能力,为复杂分子优化提供了无需重复训练的高效解决方案。
English: This study introduces MuMOInstruct, a dataset for multi-property molecule optimization, and develops GeLLMOs, an instruction-tuned LLM that outperforms existing methods and demonstrates strong zero-shot generalization to novel tasks, offering a resource-efficient solution for complex optimization challenges.

Authors:Kongcheng Zhang, Qi Yao, Baisheng Lai, Jiaxing Huang, Wenkai Fang, Dacheng Tao, Mingli Song, Shunyu Liu
Title: Reasoning with Reinforced Functional Token Tuning
Abstract:
In this work, we propose Reinforced Functional Token Tuning (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with self-play learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (e.g., , , ) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for reasoning; and (2) online reinforcement learning further allows the model to explore different reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks, significantly boosting Qwen-2.5-7B-Instruct (70.6% to 79.8%) and LLaMA-3.1-8B-Instruct (32.2% to 60.2%) on the MATH dataset. Moreover, the performance of RFTT consistently improves with more search rollouts at inference time. Our code is available at https://github.com/sastpg/RFTT.
中文: 本文提出强化功能令牌调优(RFTT),这是一种通过将可学习功能令牌嵌入模型词汇表来增强大语言模型自博弈推理能力的强化微调框架,在数学基准测试中实现了显著性能提升。
English: This paper introduces Reinforced Functional Token Tuning (RFTT), a reinforced fine-tuning framework that enhances Large Language Models with self-play reasoning abilities by embedding learnable functional tokens into the model vocabulary, achieving significant performance improvements on mathematical benchmarks.

Authors:Swati Kar, Soumyabrata Dey, Mahesh K Banavar, Shahnewaz Karim Sakib
Title: Fighter Jet Navigation and Combat using Deep Reinforcement Learning with Explainable AI
Abstract:
This paper presents the development of an Artificial Intelligence (AI) based fighter jet agent within a customized Pygame simulation environment, designed to solve multi-objective tasks via deep reinforcement learning (DRL). The jet's primary objectives include efficiently navigating the environment, reaching a target, and selectively engaging or evading an enemy. A reward function balances these goals while optimized hyperparameters enhance learning efficiency. Results show more than 80\% task completion rate, demonstrating effective decision-making. To enhance transparency, the jet's action choices are analyzed by comparing the rewards of the actual chosen action (factual action) with those of alternate actions (counterfactual actions), providing insights into the decision-making rationale. This study illustrates DRL's potential for multi-objective problem-solving with explainable AI. Project page is available at: \href{https://github.com/swatikar95/Autonomous-Fighter-Jet-Navigation-and-Combat}{Project GitHub Link}.
中文: 本研究在Pygame模拟环境中利用深度强化学习开发了AI战斗机代理,通过反事实动作分析增强可解释性,在多目标导航与作战任务中实现了超过80%的完成率。
English: This study develops an AI fighter jet agent using deep reinforcement learning in a Pygame simulation, achieving over 80% task completion in multi-objective navigation and combat while incorporating explainable AI through counterfactual action analysis.

Authors:Yiming Zeng, Wanhao Yu, Zexin Li, Tao Ren, Yu Ma, Jinghan Cao, Xiyan Chen, Tingting Yu
Title: Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications
Abstract:
Large Language Models (LLMs) have significantly advanced natural language processing, demonstrating strong capabilities in tasks such as text generation, summarization, and reasoning. Recently, their potential for automating precise text editing tasks across specialized domains, such as programming code, LaTeX, and structured database languages, has gained attention. However, current state-of-the-art LLMs still struggle with executing precise, instruction-driven edits, particularly when structural accuracy and strict adherence to domain conventions are required. To address these challenges, we introduce InstrEditBench, an automated benchmark dataset comprising over 30,000 structured editing tasks spanning diverse domains, including Wikipedia articles, LaTeX documents, source code, and database languages. Using this benchmark, we develop FineEdit, a specialized editing model explicitly trained for accurate, context-aware text modifications. Experimental evaluations demonstrate that FineEdit outperforms state-of-the-art models, achieving improvements of approximately 10\% over Gemini models on single-turn edits, up to 30\% over Llama-3.2-3B, and exceeding Mistral-7B-OpenOrca performance by over 40\% on direct editing tasks. FineEdit also effectively generalizes to realistic multi-turn editing scenarios, highlighting its practical applicability. To facilitate further research and reproducibility, we release FineEdit at https://github.com/StuRinDQB/FineEdit} and https://huggingface.co/datasets/YimingZeng/FineEdit_bench.
中文: 大语言模型在精确文本编辑方面存在局限,为此研发的FineEdit模型在多种编辑任务中显著超越现有模型,展现出卓越的性能和实用性。
English: Large Language Models face challenges in precise text editing, leading to the development of FineEdit, a specialized model that significantly outperforms existing models across various domains and editing tasks.

Authors:Shi Yu, Zhiyuan Liu, Chenyan Xiong
Title: Craw4LLM: Efficient Web Crawling for LLM Pretraining
Abstract:
Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Craw4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Craw4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Craw4LLM.
Chinese: 本文提出Craw4LLM方法,通过根据网页在LLM预训练中的影响力设定抓取优先级,仅抓取21%的网页即可达到同等模型性能,显著减少了无效抓取。
English: This paper introduces Craw4LLM, an efficient web crawling method that prioritizes webpages based on their influence in LLM pretraining, reducing wasted crawls and achieving comparable model performance with only 21% of URLs crawled.

Authors:Yunpeng Xiao, Youpeng Zhao, Kai Shu
Title: Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding
Abstract:
Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unexplainable but often results in a large number of label errors when creating datasets. To address the above limitations, we propose a new NLU annotation guideline based on individual-level factors. Specifically, we incorporate other posts by the same individual and then annotate individual subjective perspectives after considering all individual posts. We use this guideline to expand and re-annotate the stance detection and topic-based sentiment analysis datasets. We find that error rates in the samples were as high as 31.7\% and 23.3\%. We further use large language models to conduct experiments on the re-annotation datasets and find that the large language models perform well on both datasets after adding individual factors. Both GPT-4o and Llama3-70B can achieve an accuracy greater than 87\% on the re-annotation datasets. We also verify the effectiveness of individual factors through ablation studies. We call on future researchers to add individual factors when creating such datasets. Our re-annotation dataset can be found at https://github.com/24yearsoldstudent/Individual-NLU
中文: 本研究针对立场检测和主题情感分析等自然语言理解任务,提出了基于个体层面的标注新方法,发现原数据集的标注错误率高达31.7%和23.3%,并通过实验证明引入个体因素能显著提升大语言模型的准确率至87%以上。
English: This study introduces individual-level annotation guidelines for natural language understanding tasks like stance detection and sentiment analysis, revealing high error rates in existing datasets and demonstrating that incorporating individual factors significantly improves model performance.

Authors:Sangwoong Yoon, Himchan Hwang, Hyeokju Jeong, Dong Kyu Shin, Che-Sang Park, Sehee Kweon, Frank Chongwoo Park
Title: Value Gradient Sampler: Sampling as Sequential Decision Making
Abstract:
We propose the Value Gradient Sampler (VGS), a trainable sampler based on the interpretation of sampling as discrete-time sequential decision-making. VGS generates samples from a given unnormalized density (i.e., energy) by drifting and diffusing randomly initialized particles. In VGS, finding the optimal drift is equivalent to solving an optimal control problem where the cost is the upper bound of the KL divergence between the target density and the samples. We employ value-based dynamic programming to solve this optimal control problem, which gives the gradient of the value function as the optimal drift vector. The connection to sequential decision making allows VGS to leverage extensively studied techniques in reinforcement learning, making VGS a fast, adaptive, and accurate sampler that achieves competitive results in various sampling benchmarks. Furthermore, VGS can replace MCMC in contrastive divergence training of energy-based models. We demonstrate the effectiveness of VGS in training accurate energy-based models in industrial anomaly detection applications.
Chinese: 价值梯度采样器(VGS)是一种可训练的采样方法,将采样视为序列决策过程,通过基于价值的动态规划优化粒子漂移,实现高效精确的采样,在基准测试中表现优异,并能有效训练基于能量的模型,适用于工业异常检测等场景。
English: The Value Gradient Sampler (VGS) is a trainable sampling method that frames sampling as sequential decision-making, using value-based dynamic programming to optimize particle drift for efficient and accurate sampling, achieving competitive results in benchmarks and enabling effective training of energy-based models in applications like anomaly detection.

Authors:Aldo Glielmo, Mitja Devetak, Adriano Meligrana, Sebastian Poledna
Title: BeforeIT.jl: High-Performance Agent-Based Macroeconomics Made Easy
Abstract:
BeforeIT is an open-source software for building and simulating state-of-the-art macroeconomic agent-based models (macro ABMs) based on the recently introduced macro ABM developed in [1] and here referred to as the base model. Written in Julia, it combines extraordinary computational efficiency with user-friendliness and extensibility. We present the main structure of the software, demonstrate its ease of use with illustrative examples, and benchmark its performance. Our benchmarks show that the base model built with BeforeIT is orders of magnitude faster than a Matlab version, and significantly faster than Matlab-generated C code. BeforeIT is designed to facilitate reproducibility, extensibility, and experimentation. As the first open-source, industry-grade software to build macro ABMs of the type of the base model, BeforeIT can significantly foster collaboration and innovation in the field of agent-based macroeconomic modelling. The package, along with its documentation, is freely available at https://github.com/bancaditalia/BeforeIT.jl under the AGPL-3.0.
Chinese: BeforeIT 是一款基于 Julia 的开源软件,用于构建和模拟高性能的宏观经济代理模型,具有卓越的计算效率、易用性和可扩展性,旨在推动该领域的协作与创新。
English: BeforeIT is an open-source Julia-based software for building and simulating high-performance macroeconomic agent-based models, offering exceptional computational speed, user-friendliness, and extensibility to advance collaboration in the field.

Authors:Jake C. Snell, Thomas L. Griffiths
Title: Conformal Prediction as Bayesian Quadrature
Abstract:
As machine learning-based prediction systems are increasingly used in high-stakes situations, it is important to understand how such predictive models will perform upon deployment. Distribution-free uncertainty quantification techniques such as conformal prediction provide guarantees about the loss black-box models will incur even when the details of the models are hidden. However, such methods are based on frequentist probability, which unduly limits their applicability. We revisit the central aspects of conformal prediction from a Bayesian perspective and thereby illuminate the shortcomings of frequentist guarantees. We propose a practical alternative based on Bayesian quadrature that provides interpretable guarantees and offers a richer representation of the likely range of losses to be observed at test time.
中文: 该摘要提出了一种贝叶斯替代方法,取代了频率主义的共形预测技术,为高风险应用中的机器学习模型提供了可解释的保证和更丰富的潜在损失范围描述。
English: This abstract proposes a Bayesian alternative to frequentist conformal prediction methods, offering interpretable guarantees and a richer representation of potential losses for machine learning models in high-stakes applications.

Authors:Junyi Guan, Abhijith Sharma, Chong Tian, Salem Lahlou
Title: On the Privacy Risks of Spiking Neural Networks: A Membership Inference Analysis
Abstract:
Spiking Neural Networks (SNNs) are increasingly explored for their energy efficiency and robustness in real-world applications, yet their privacy risks remain largely unexamined. In this work, we investigate the susceptibility of SNNs to Membership Inference Attacks (MIAs) -- a major privacy threat where an adversary attempts to determine whether a given sample was part of the training dataset. While prior work suggests that SNNs may offer inherent robustness due to their discrete, event-driven nature, we find that its resilience diminishes as latency (T) increases. Furthermore, we introduce an input dropout strategy under black box setting, that significantly enhances membership inference in SNNs. Our findings challenge the assumption that SNNs are inherently more secure, and even though they are expected to be better, our results reveal that SNNs exhibit privacy vulnerabilities that are equally comparable to Artificial Neural Networks (ANNs). Our code is available at https://github.com/sharmaabhijith/MIA_SNN.
中文: 脉冲神经网络(SNN)易受成员推理攻击,其隐私风险随延迟增加而加剧,与人工神经网络(ANN)相当,尽管人们曾假设其具有固有的鲁棒性。
English: Spiking Neural Networks (SNNs) are vulnerable to membership inference attacks, with privacy risks increasing with latency and comparable to those of Artificial Neural Networks (ANNs), despite assumptions of inherent robustness.

Authors:Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu
Title: MoBA: Mixture of Block Attention for Long-Context LLMs
Abstract:
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.
中文摘要:提出的混合块注意力(MoBA)使大型语言模型能够自主决定注意力模式,在长上下文任务中表现优异,同时实现完整与稀疏注意力机制的高效切换。
English Summary: The proposed Mixture of Block Attention (MoBA) enables large language models to autonomously determine attention patterns, achieving superior performance on long-context tasks while efficiently switching between full and sparse attention mechanisms.

Authors:Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, Min Zhang
Title: PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models
Abstract:
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an unstructured fine-grained mask to explicitly distinguish salient weights, while which introduces an extra 1-bit or more per weight. To explore the real limit of PTQ, we propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Specifically, we first introduce a one-dimensional structured mask with negligibly additional 0.0002-bit per weight based on input activations from the perspective of reducing the upper bound of quantization error to allocate corresponding salient weight channels to 4-bit. For non-salient channels binarization, an efficient block-wise scaling factors optimization framework is then presented to take implicit row-wise correlations and angular biases into account. Different from prior works that concentrate on adjusting quantization methodologies, we further propose a novel paradigm called quantization preprocessing, where we argue that transforming the weight distribution of the pretrained model before quantization can alleviate the difficulty in per-channel extremely low-bit PTQ. Extensive experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization. Codes are available at https://github.com/zjq0455/PTQ1.61.
Chinese: 大语言模型在超低位(低于2位)量化时性能严重下降,而提出的PTQ1.61方法通过结构化掩码和量化预处理优化权重分布,首次实现了1.61位量化的最先进性能。
English: Large Language Models experience significant performance loss with sub 2-bit quantization, but the proposed PTQ1.61 method achieves state-of-the-art 1.61-bit quantization by introducing structured masking and quantization preprocessing to optimize weight distribution.

Authors:Jiaqi Zhao, Ming Wang, Miao Zhang, Yuzhang Shang, Xuebo Liu, Yaowei Wang, Min Zhang, Liqiang Nie
Title: Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis
Abstract:
Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression owing to its efficiency and low resource requirement. However, current research lacks a in-depth analysis of the superior and applicable scenarios of each PTQ strategy. In addition, existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. To mitigate these confusions, we provide a novel benchmark for LLMs PTQ in this paper. Firstly, in order to support our benchmark, we propose a comprehensive taxonomy for existing mainstream methods by scrutinizing their computational strategies (e.g., optimization-based, compensation-based, etc.). Then, we conduct extensive experiments with the baseline within each class, covering models with various sizes (7B-70B), bitwidths, training levels (LLaMA1/2/3/3.1), architectures (Mixtral, DeepSeekMoE and Mamba) and modality (LLaVA1.5 and VILA1.5) on a wide range of evaluation metrics.Through comparative analysis on the results, we summarize the superior of each PTQ strategy and modelsize-bitwidth trade-off considering the performance. For example, our benchmark reveals that compensation-based technique demonstrates outstanding cross-architecture robustness and extremely low-bit PTQ for ultra large models should be reexamined. Finally, we further accordingly claim that a practical combination of compensation and other PTQ strategy can achieve SOTA various robustness. We believe that our benchmark will provide valuable recommendations for the deployment of LLMs and future research on PTQ approaches.We conduct an repository for our benchmark at https://github.com/zjq0455/PTQ_Benchmark.
中文: 本文为大型语言模型的后训练量化提出了一套全面的基准测试,通过分析不同模型规模、架构和比特宽度的量化策略,揭示了补偿类方法的优越跨架构鲁棒性,并指出了超低比特量化的局限性。
English: This paper introduces a comprehensive benchmark for post-training quantization (PTQ) of large language models, analyzing various strategies across different model sizes, architectures, and bitwidths to identify optimal approaches and trade-offs, while highlighting the robustness of compensation-based techniques.

Authors:Yuze Zhao, Tianyun Ji, Wenjun Feng, Zhenya Huang, Qi Liu, Zhiding Liu, Yixiao Ma, Kai Zhang, Enhong Chen
Title: Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment
Abstract:
The reasoning abilities are one of the most enigmatic and captivating aspects of large language models (LLMs). Numerous studies are dedicated to exploring and expanding the boundaries of this reasoning capability. However, tasks that embody both reasoning and recall characteristics are often overlooked. In this paper, we introduce such a novel task, code reasoning, to provide a new perspective for the reasoning abilities of LLMs. We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks. Our testing on these benchmarks reveals that LLMs continue to struggle with identifying satisfactory reasoning pathways. Additionally, we present a new pathway exploration pipeline inspired by human intricate problem-solving methods. This Reflective Hypothesis Decomposition and Amendment (RHDA) pipeline consists of the following iterative steps: (1) Proposing potential hypotheses based on observations and decomposing them; (2) Utilizing tools to validate hypotheses and reflection outcomes; (3) Revising hypothesis in light of observations. Our approach effectively mitigates logical chain collapses arising from forgetting or hallucination issues in multi-step reasoning, resulting in performance gains of up to $3\times$. Finally, we expanded this pipeline by applying it to simulate complex household tasks in real-world scenarios, specifically in VirtualHome, enhancing the handling of failure cases. We release our code and all of results at https://github.com/TnTWoW/code_reasoning.
中文摘要:本文提出了一种新的代码推理任务来评估大型语言模型的推理能力,并设计了一种反思性假设分解与修正流程,通过减少多步推理中的逻辑链断裂问题显著提升了性能。
English Summary: This paper introduces a novel code reasoning task to assess LLMs' reasoning abilities, proposing a Reflective Hypothesis Decomposition and Amendment pipeline that significantly improves performance by mitigating logical chain collapses in multi-step reasoning.

Authors:Kun Hu, Qicai Chen, Zilong Lu, Wenzhuo Zhang, Bihuan Chen, You Lu, Haowen Jiang, Bingkun Sun, Xin Peng, Wenyun Zhao
Title: A Survey of Fuzzing Open-Source Operating Systems
Abstract:
Vulnerabilities in open-source operating systems (OSs) pose substantial security risks to software systems, making their detection crucial. While fuzzing has been an effective vulnerability detection technique in various domains, OS fuzzing (OSF) faces unique challenges due to OS complexity and multi-layered interaction, and has not been comprehensively reviewed. Therefore, this work systematically surveys the state-of-the-art OSF techniques, categorizes them based on the general fuzzing process, and investigates challenges specific to kernel, file system, driver, and hypervisor fuzzing. Finally, future research directions for OSF are discussed. GitHub: https://github.com/pghk13/Survey-OSF.
中文: 本文系统综述了最新的操作系统模糊测试技术,按流程分类并探讨了内核、文件系统、驱动程序和虚拟机监控程序等组件的特定挑战,同时展望了未来研究方向。
English: This paper systematically reviews state-of-the-art OS fuzzing techniques, categorizing them by process and examining specific challenges across kernel, file system, driver, and hypervisor components while outlining future research directions.

Authors:Shuo Xing, Peiran Li, Yuping Wang, Ruizheng Bai, Yueqi Wang, Chan-Wei Hu, Chengxuan Qian, Huaxiu Yao, Zhengzhong Tu
Title: Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization
Abstract:
The emergence of large Vision Language Models (VLMs) has broadened the scope and capabilities of single-modal Large Language Models (LLMs) by integrating visual modalities, thereby unlocking transformative cross-modal applications in a variety of real-world scenarios. Despite their impressive performance, VLMs are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies. Building on the success of Reinforcement Learning from Human Feedback (RLHF) in aligning LLMs, recent advancements have focused on applying direct preference optimization (DPO) on carefully curated datasets to mitigate these issues. Yet, such approaches typically introduce preference signals in a brute-force manner, neglecting the crucial role of visual information in the alignment process. In this paper, we introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset, effectively incorporating both textual and visual preference signals. We further introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning. Our experimental results demonstrate that Re-Align not only mitigates hallucinations more effectively than previous methods but also yields significant performance gains in general visual question-answering (VQA) tasks. Moreover, we show that Re-Align maintains robustness and scalability across a wide range of VLM sizes and architectures. This work represents a significant step forward in aligning multimodal LLMs, paving the way for more reliable and effective cross-modal applications. We release all the code in https://github.com/taco-group/Re-Align.
中文: Re-Align框架通过图像检索和扩展的rDPO方法,有效减少视觉语言模型中的跨模态幻觉问题,并显著提升视觉问答任务的性能。
English: The Re-Align framework introduces image retrieval and an extended rDPO method to effectively reduce cross-modal hallucinations in Vision Language Models while enhancing performance in visual question-answering tasks.

Authors:Bencheng Liao, Hongyuan Tao, Qian Zhang, Tianheng Cheng, Yingyue Li, Haoran Yin, Wenyu Liu, Xinggang Wang
Title: Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
Abstract:
Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE's capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba
中文:mmMamba框架通过渐进式蒸馏将多模态大语言模型转换为线性复杂度架构,在保持竞争力的性能的同时实现了显著的速度提升和内存节省。
English: The mmMamba framework enables efficient conversion of multimodal large language models to linear-complexity architectures through progressive distillation, achieving significant speed improvements and memory savings while maintaining competitive performance.

Authors:Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
Title: Rethinking Diverse Human Preference Learning through Principal Component Analysis
Abstract:
Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment. Our code is available at https://github.com/amandaluof/DRMs.
中文: 本文提出分解奖励模型(DRMs),通过向量表征和主成分分析从二元比较中提取多样化人类偏好,为个性化大语言模型对齐提供可解释且可扩展的替代方案。
English: This paper introduces Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons using vector representations and PCA, providing an interpretable and scalable alternative to traditional reward models for personalized LLM alignment.

Authors:Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Abstract:
Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, leading to cumbersome training and inference pipelines, as well as suboptimal overall generation quality due to error accumulation across stages. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The code is available at https://github.com/LiuZH-19/SongGen.
中文摘要:SongGen是一个开源的单阶段自回归变换器模型,能够通过细粒度音乐属性控制和可选语音克隆实现可控的文本到歌曲生成,并支持混合与双轨两种输出模式。
English Summary: SongGen is an open-source, single-stage autoregressive transformer model that enables controllable text-to-song generation with fine-grained musical attribute control and optional voice cloning, supporting both mixed and dual-track output modes.

Authors:Ekin Celikkan, Timo Kunzmann, Yertay Yeskaliyev, Sibylle Itzerott, Nadja Klein, Martin Herold
Title: WeedsGalore: A Multispectral and Multitemporal UAV-based Dataset for Crop and Weed Segmentation in Agricultural Maize Fields
Abstract:
Weeds are one of the major reasons for crop yield loss but current weeding practices fail to manage weeds in an efficient and targeted manner. Effective weed management is especially important for crops with high worldwide production such as maize, to maximize crop yield for meeting increasing global demands. Advances in near-sensing and computer vision enable the development of new tools for weed management. Specifically, state-of-the-art segmentation models, coupled with novel sensing technologies, can facilitate timely and accurate weeding and monitoring systems. However, learning-based approaches require annotated data and show a lack of generalization to aerial imaging for different crops. We present a novel dataset for semantic and instance segmentation of crops and weeds in agricultural maize fields. The multispectral UAV-based dataset contains images with RGB, red-edge, and near-infrared bands, a large number of plant instances, dense annotations for maize and four weed classes, and is multitemporal. We provide extensive baseline results for both tasks, including probabilistic methods to quantify prediction uncertainty, improve model calibration, and demonstrate the approach's applicability to out-of-distribution data. The results show the effectiveness of the two additional bands compared to RGB only, and better performance in our target domain than models trained on existing datasets. We hope our dataset advances research on methods and operational systems for fine-grained weed identification, enhancing the robustness and applicability of UAV-based weed management. The dataset and code are available at https://github.com/GFZ/weedsgalore
中文: 本研究提出了一种新型多光谱无人机数据集,用于玉米田间作物和杂草的语义与实例分割,通过额外光谱波段和优化模型校准显著提升了杂草管理的准确性和适用性。
English: This study introduces a novel multispectral UAV dataset for semantic and instance segmentation of crops and weeds in maize fields, demonstrating enhanced performance with additional spectral bands and improved model calibration for robust weed management.

Authors:Yuxiang Wei, Yiheng Zheng, Yabo Zhang, Ming Liu, Zhilong Ji, Lei Zhang, Wangmeng Zuo
Title: Personalized Image Generation with Deep Generative Models: A Decade Survey
Abstract:
Recent advancements in generative models have significantly facilitated the development of personalized content creation. Given a small set of images with user-specific concept, personalized image generation allows to create images that incorporate the specified concept and adhere to provided text descriptions. Due to its wide applications in content creation, significant effort has been devoted to this field in recent years. Nonetheless, the technologies used for personalization have evolved alongside the development of generative models, with their distinct and interrelated components. In this survey, we present a comprehensive review of generalized personalized image generation across various generative models, including traditional GANs, contemporary text-to-image diffusion models, and emerging multi-model autoregressive models. We first define a unified framework that standardizes the personalization process across different generative models, encompassing three key components, i.e., inversion spaces, inversion methods, and personalization schemes. This unified framework offers a structured approach to dissecting and comparing personalization techniques across different generative architectures. Building upon this unified framework, we further provide an in-depth analysis of personalization techniques within each generative model, highlighting their unique contributions and innovations. Through comparative analysis, this survey elucidates the current landscape of personalized image generation, identifying commonalities and distinguishing features among existing methods. Finally, we discuss the open challenges in the field and propose potential directions for future research. We keep tracing related works at https://github.com/csyxwei/Awesome-Personalized-Image-Generation.
中文摘要:本综述系统梳理了各类生成模型中的个性化图像生成技术,通过建立统一框架分析核心组件并进行方法对比,同时指出了未来研究方向。
English Summary: This survey comprehensively reviews personalized image generation techniques across various generative models, establishing a unified framework to analyze their key components and comparing methods while identifying future research directions.

Authors:Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
Title: Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
Abstract:
Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While Large Multimodal Models (LMMs) have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both supervised fine-tuning (SFT) and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Analysis reveals that our approach achieves improved robustness under adversarial attacks compared to SFT models. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems. Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability. Code available at https://github.com/JingbiaoMei/RGCL
Chinese: 本文提出了一种鲁棒适应框架,通过提升领域内准确性和跨领域泛化能力来增强仇恨表情包检测,同时保留大型多模态模型的视觉语言能力,实现了最先进的性能并通过高质量解释提升了模型可解释性。
English: This paper introduces a robust adaptation framework that enhances hateful meme detection by improving in-domain accuracy and cross-domain generalization while preserving LMMs' vision-language capabilities, achieving state-of-the-art performance and superior interpretability through high-quality rationales.

Authors:Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang
Title: HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
Abstract:
Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods. Our code is available at https://github.com/thu-coai/HPSS.
中文: 研究者提出HPSS方法,通过整合多个提示因素自动优化大语言模型评估策略,有效提升其与人类判断的一致性,在多项评估任务中表现优于现有方法。
English: Researchers propose HPSS, an automatic prompting strategy optimization method that integrates multiple factors to enhance LLM evaluators' alignment with human judgment, outperforming existing approaches across various tasks.

Authors:Avinash Kori, Antonio Rago, Francesca Toni
Title: Free Argumentative Exchanges for Explaining Image Classifiers
Abstract:
Deep learning models are powerful image classifiers but their opacity hinders their trustworthiness. Explanation methods for capturing the reasoning process within these classifiers faithfully and in a clear manner are scarce, due to their sheer complexity and size. We provide a solution for this problem by defining a novel method for explaining the outputs of image classifiers with debates between two agents, each arguing for a particular class. We obtain these debates as concrete instances of Free Argumentative eXchanges (FAXs), a novel argumentation-based multi-agent framework allowing agents to internalise opinions by other agents differently than originally stated. We define two metrics (consensus and persuasion rate) to assess the usefulness of FAXs as argumentative explanations for image classifiers. We then conduct a number of empirical experiments showing that FAXs perform well along these metrics as well as being more faithful to the image classifiers than conventional, non-argumentative explanation methods. All our implementations can be found at https://github.com/koriavinash1/FAX.
中文摘要:本文提出了一种基于辩论的新型方法,通过两个代理之间的辩论来解释图像分类器的输出,实验证明该方法比传统非辩论式解释方法更忠实、更清晰。
English Summary: This paper introduces a novel argumentation-based method using debates between agents to explain image classifier outputs, demonstrating through experiments that it provides more faithful and clear explanations than traditional non-argumentative approaches.

Authors:Rema Daher, Francisco Vasconcelos, Danail Stoyanov
Title: SHADeS: Self-supervised Monocular Depth Estimation Through Non-Lambertian Image Decomposition
Abstract:
Purpose: Visual 3D scene reconstruction can support colonoscopy navigation. It can help in recognising which portions of the colon have been visualised and characterising the size and shape of polyps. This is still a very challenging problem due to complex illumination variations, including abundant specular reflections. We investigate how to effectively decouple light and depth in this problem. Methods: We introduce a self-supervised model that simultaneously characterises the shape and lighting of the visualised colonoscopy scene. Our model estimates shading, albedo, depth, and specularities (SHADeS) from single images. Unlike previous approaches (IID), we use a non-Lambertian model that treats specular reflections as a separate light component. The implementation of our method is available at https://github.com/RemaDaher/SHADeS. Results: We demonstrate on real colonoscopy images (Hyper Kvasir) that previous models for light decomposition (IID) and depth estimation (MonoVIT, ModoDepth2) are negatively affected by specularities. In contrast, SHADeS can simultaneously produce light decomposition and depth maps that are robust to specular regions. We also perform a quantitative comparison on phantom data (C3VD) where we further demonstrate the robustness of our model. Conclusion: Modelling specular reflections improves depth estimation in colonoscopy. We propose an effective self-supervised approach that uses this insight to jointly estimate light decomposition and depth. Light decomposition has the potential to help with other problems, such as place recognition within the colon.
中文摘要:本研究提出的SHADeS自监督模型通过单独处理镜面反射,有效提升了结肠镜三维重建中的深度估计和光照分解的鲁棒性。
English Summary: The study introduces SHADeS, a self-supervised model that improves colonoscopy 3D reconstruction by separately modeling specular reflections to enhance depth estimation and light decomposition robustness.

Authors:Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Hieu Le, Doruk Oner, Pascal Fua
Title: PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization
Abstract:
Accurate 3D shape representation is essential in engineering applications such as design, optimization, and simulation. In practice, engineering workflows require structured, part-based representations, as objects are inherently designed as assemblies of distinct components. However, most existing methods either model shapes holistically or decompose them without predefined part structures, limiting their applicability in real-world design tasks. We propose PartSDF, a supervised implicit representation framework that explicitly models composite shapes with independent, controllable parts while maintaining shape consistency. Thanks to its simple but innovative architecture, PartSDF outperforms both supervised and unsupervised baselines in reconstruction and generation tasks. We further demonstrate its effectiveness as a structured shape prior for engineering applications, enabling precise control over individual components while preserving overall coherence. Code available at https://github.com/cvlab-epfl/PartSDF.
中文: PartSDF是一种监督式隐式表示框架,通过独立可控部件建模复合形状,在重建和生成任务中优于现有方法,并为工程应用提供精确的组件控制能力。
English: PartSDF is a supervised implicit representation framework that models composite shapes with independent, controllable parts, outperforming existing methods in reconstruction and generation tasks while enabling precise component control for engineering applications.

Authors:Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Doruk Oner, Pascal Fua
Title: PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization
Abstract:
Accurate 3D shape representation is essential in engineering applications such as design, optimization, and simulation. In practice, engineering workflows require structured, part-aware representations, as objects are inherently designed as assemblies of distinct components. However, most existing methods either model shapes holistically or decompose them without predefined part structures, limiting their applicability in real-world design tasks. We propose PartSDF, a supervised implicit representation framework that explicitly models composite shapes with independent, controllable parts while maintaining shape consistency. Despite its simple single-decoder architecture, PartSDF outperforms both supervised and unsupervised baselines in reconstruction and generation tasks. We further demonstrate its effectiveness as a structured shape prior for engineering applications, enabling precise control over individual components while preserving overall coherence. Code available at https://github.com/cvlab-epfl/PartSDF.
中文: PartSDF是一种监督式隐式表示框架,通过独立可控部件建模复合形状,在重建和生成任务中优于现有方法,并为工程应用提供精确的组件控制能力。
English: PartSDF is a supervised implicit representation framework that models composite shapes with independent, controllable parts, outperforming existing methods in reconstruction and generation tasks while enabling precise component control for engineering applications.

Authors:Steffen Schneider, Rodrigo González Laiz, Anastasiia Filippova, Markus Frey, Mackenzie Weygandt Mathis
Title: Time-series attribution maps with regularized contrastive learning
Abstract:
Gradient-based attribution methods aim to explain decisions of deep learning models but so far lack identifiability guarantees. Here, we propose a method to generate attribution maps with identifiability guarantees by developing a regularized contrastive learning algorithm trained on time-series data plus a new attribution method called Inverted Neuron Gradient (collectively named xCEBRA). We show theoretically that xCEBRA has favorable properties for identifying the Jacobian matrix of the data generating process. Empirically, we demonstrate robust approximation of zero vs. non-zero entries in the ground-truth attribution map on synthetic datasets, and significant improvements across previous attribution methods based on feature ablation, Shapley values, and other gradient-based methods. Our work constitutes a first example of identifiable inference of time-series attribution maps and opens avenues to a better understanding of time-series data, such as for neural dynamics and decision-processes within neural networks.
中文:提出的xCEBRA方法通过正则化对比学习和反向神经元梯度,为深度学习模型提供可识别的归因图,在时间序列数据分析中展现出理论保证和优于现有方法的实证效果。
English: The proposed xCEBRA method provides identifiable attribution maps for deep learning models through regularized contrastive learning and inverted neuron gradients, demonstrating theoretical guarantees and empirical improvements over existing methods in time-series data analysis.

Authors:Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, Maosong Sun
Title: Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search
Abstract:
Recent dense retrievers usually thrive on the emergency capabilities of Large Language Models (LLMs), using them to encode queries and documents into an embedding space for retrieval. These LLM-based dense retrievers have shown promising performance across various retrieval scenarios. However, relying on a single embedding to represent documents proves less effective in capturing different perspectives of documents for matching. In this paper, we propose Deliberate Thinking based Dense Retriever (DEBATER), which enhances these LLM-based retrievers by enabling them to learn more effective document representations through a step-by-step thinking process. DEBATER introduces the Chain-of-Deliberation mechanism to iteratively optimize document representations using a continuous chain of thought. To consolidate information from various thinking steps, DEBATER also incorporates the Self Distillation mechanism, which identifies the most informative thinking steps and integrates them into a unified text embedding. Experimental results show that DEBATER significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes are available at https://github.com/OpenBMB/DEBATER.
Chinese Summary: 本文提出Debater方法,通过链式思考机制迭代优化文档嵌入表示,并结合自蒸馏技术融合关键思考步骤,在多个检索基准上显著超越了现有方法的性能表现。
English Summary: The paper introduces Debater, a dense retriever that enhances document representation through iterative Chain-of-Deliberation and Self Distillation mechanisms, significantly outperforming existing methods across multiple benchmarks.

Authors:Yifan Ji, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Shi Yu, Yishan Li, Zhiyuan Liu, Yu Gu, Ge Yu, Maosong Sun
Title: Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking
Abstract:
Recent dense retrievers increasingly leverage the robust text understanding capabilities of Large Language Models (LLMs), encoding queries and documents into a shared embedding space for effective retrieval. However, most existing methods represent each document with a single embedding, which is less effective at capturing its multifaceted semantics and thereby limits matching accuracy. In this paper, we propose Deliberate Thinking based Dense Retriever (Debater), a novel approach that enhances document representations by incorporating a step-by-step thinking process. Debater introduces a Chain-of-Deliberation mechanism, which iteratively refines document embeddings through a continuous chain-of-thought. To integrate information from various thinking steps, Debater further employs a Self Distillation mechanism that identifies and fuses the most informative steps into a unified embedding. Experimental results show that Debater significantly outperforms existing methods across several retrieval benchmarks, demonstrating superior accuracy and robustness. All codes and datasets are available at https://github.com/OpenBMB/DEBATER.
Chinese Summary: 本文提出Debater方法,通过链式思考机制迭代优化文档嵌入表示,并结合自蒸馏技术融合关键思考步骤,在多个检索基准上显著超越了现有方法的性能表现。
English Summary: The paper introduces Debater, a dense retriever that enhances document representation through iterative Chain-of-Deliberation and Self Distillation mechanisms, significantly outperforming existing methods across multiple benchmarks.

Authors:Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu
Title: A Survey of Text Classification Under Class Distribution Shift
Abstract:
The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.
中文: 本综述探讨了针对测试数据分布变化而设计的开放集文本分类方法,按问题约束和应对策略对方法进行分类,强调持续学习作为关键解决方案,并指出了未来的研究方向。
English: This survey examines open-set text classification methods that address distribution shifts in test data, categorizing approaches by problem constraints and mitigation strategies while highlighting continual learning as a key solution and identifying future research directions.

Authors:Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov
Title: Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer
Abstract:
Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a question correctly, but a seemingly trivial perturbation, which can happen in real-world settings, causes it to produce a hallucinated response with high certainty. This phenomenon, which we dub CHOKE (Certain Hallucinations Overriding Known Evidence), is particularly concerning in high-stakes domains such as medicine or law, where model certainty is often used as a proxy for reliability. We show that CHOKE examples are consistent across prompts, occur in different models and datasets, and are fundamentally distinct from other hallucinations. This difference leads existing mitigation methods to perform worse on CHOKE examples than on general hallucinations. Finally, we introduce a probing-based mitigation that outperforms existing methods on CHOKE hallucinations. These findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety. The code is available at https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
中文摘要:本研究提出CHOKE现象,即大型语言模型在轻微输入扰动下会覆盖正确知识产生自信但错误的幻觉响应,这种在高风险领域尤为严重的新型幻觉与常规幻觉存在本质区别,且现有缓解方法对其效果有限。
English Summary: This study identifies CHOKE, a distinct type of LLM hallucination where minor input perturbations cause models to override correct knowledge with confident but wrong responses, particularly problematic in high-stakes domains and resistant to current mitigation methods.

Authors:Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
Title: Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text
Abstract:
Masked language modeling has become a widely adopted unsupervised technique to pre-train large language models (LLMs). However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at https://github.com/JarcaAndrei/TIACBM.
中文摘要:本文提出了一种基于任务信息的反课程掩码方法,通过动态调整掩码比例和选择策略,使模型能聚焦于任务关键特征,在多项自然语言处理任务中实现了显著性能提升。
English Summary: The paper introduces a task-informed anti-curriculum masking approach that dynamically adjusts masking ratios and token selection, significantly improving model performance across multiple NLP tasks by focusing on task-relevant features.

Authors:Lakshmi Nair, Ian Trase, Mark Kim
Title: Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options
Abstract:
We present a novel reasoning approach called Flow-of-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). Flow-of-Options enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic framework developed for autonomously solving Machine Learning (ML) tasks. FoO enforces diversity in LLM solutions through compressed and interpretable task representations, resulting in improvements of 38.2% - 69.2% on standard data science tasks, and 37.4% - 47.9% on therapeutic chemistry tasks, as compared to state-of-the-art baselines. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Going beyond tabular classification and regression, we show the broader applicability of our FoO-based agentic system to tasks such as reinforcement learning and image generation. Our code is open-sourced at: https://github.com/flagshippioneering/Flow-of-Options.
中文摘要:Flow-of-Options(FoO)是一种新颖的推理方法,通过系统探索多样化解决方案来减少大语言模型的内在偏差,在数据科学任务上实现38.2%-69.2%、在治疗化学任务上实现37.4%-47.9%的性能提升,且单任务成本低于1美元。
English Summary: Flow-of-Options (FoO) is a novel reasoning approach that mitigates LLM biases by systematically exploring diverse solution possibilities, achieving performance improvements of 38.2%-69.2% on data science tasks and 37.4%-47.9% on therapeutic chemistry tasks while costing under $1 per task.

Authors:David Genova, Philippe Esling, Tom Hurlin
Title: Keep what you need : extracting efficient subnetworks from large audio representation models
Abstract:
Recently, research on audio foundation models has witnessed notable advances, as illustrated by the ever improving results on complex downstream tasks. Subsequently, those pretrained networks have quickly been used for various audio applications. These improvements have however resulted in a considerable increase both in size and complexity of these models. Along the environmental concerns this issue raises, this prevents the deployment of such networks on consumer-level devices, and precludes their use for real-time applications. Moreover, this appears contradictory with the specificity of the tasks for which these models are used, which are often simpler compared to extracting a rich, multi-purpose representation from any type of audio data. In this paper, we address this issue with a simple, yet effective method to extract lightweight specialist subnetworks from large foundation models. Specifically, we introduce learnable binary masks in-between the layers of a pretrained representation model. When training the end-to-end model on a downstream task, we add a sparsity-inducing loss to the overall objective, hence learning a compact subnetwork specialized on a single task. Importantly, the weights of the foundation model are kept frozen, resulting into low additional training costs. Once trained, the masked computational units can then be removed from the network, implying significant performance gains. We assess our method on three widespread audio foundation models, each based on a different backbone architecture, and illustrate its effectiveness on common audio representation evaluation tasks, as well as its versatility on both speech, music, and general audio. Code for reproducing the results and supporting webpage are available at https://github.com/gnvIRCAM/Audio-representation-trimming
中文摘要:本文提出一种通过可学习二值掩码和稀疏性损失,从大型音频基础模型中提取轻量级专用子网络的简单有效方法,无需重新训练核心模型即可实现高效部署。
English Summary: This paper introduces a simple yet effective method to extract lightweight, task-specific subnetworks from large audio foundation models by using learnable binary masks and a sparsity-inducing loss, enabling efficient deployment without retraining the core model.

Authors:Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
Title: Soundwave: Less is More for Speech-Text Alignment in LLMs
Abstract:
Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.
中文: Soundwave是一种数据高效的语音大语言模型,通过新颖架构解决了语音与文本的差异,仅用五十分之一的训练数据就超越了Qwen2-Audio,同时保持了对话智能。
English: Soundwave is a data-efficient speech LLM that bridges the speech-text gap with a novel architecture, outperforming Qwen2-Audio using only 1/50th of the training data while maintaining conversational intelligence.

Authors:Emmanuel K. Raptis, Athanasios Ch. Kapoutsis, Elias B. Kosmatopoulos
Title: RobotIQ: Empowering Mobile Robots with Human-Level Planning for Real-World Execution
Abstract:
This paper introduces RobotIQ, a framework that empowers mobile robots with human-level planning capabilities, enabling seamless communication via natural language instructions through any Large Language Model. The proposed framework is designed in the ROS architecture and aims to bridge the gap between humans and robots, enabling robots to comprehend and execute user-expressed text or voice commands. Our research encompasses a wide spectrum of robotic tasks, ranging from fundamental logical, mathematical, and learning reasoning for transferring knowledge in domains like navigation, manipulation, and object localization, enabling the application of learned behaviors from simulated environments to real-world operations. All encapsulated within a modular crafted robot library suite of API-wise control functions, RobotIQ offers a fully functional AI-ROS-based toolset that allows researchers to design and develop their own robotic actions tailored to specific applications and robot configurations. The effectiveness of the proposed system was tested and validated both in simulated and real-world experiments focusing on a home service scenario that included an assistive application designed for elderly people. RobotIQ with an open-source, easy-to-use, and adaptable robotic library suite for any robot can be found at https://github.com/emmarapt/RobotIQ.
中文: 本文提出RobotIQ框架,通过大语言模型将人类级规划能力与自然语言指令相结合,使机器人能在模拟和真实环境中完成从导航到操作等各类任务。
English: This paper presents RobotIQ, a framework that integrates human-level planning with natural language instructions through Large Language Models, enabling robots to perform tasks from navigation to manipulation in both simulated and real-world environments.

Authors:Iury Cleveston, Alana C. Santana, Paula D. P. Costa, Ricardo R. Gudwin, Alexandre S. Simões, Esther L. Colombini
Title: InstructRobot: A Model-Free Framework for Mapping Natural Language Instructions into Robot Motion
Abstract:
The ability to communicate with robots using natural language is a significant step forward in human-robot interaction. However, accurately translating verbal commands into physical actions is promising, but still presents challenges. Current approaches require large datasets to train the models and are limited to robots with a maximum of 6 degrees of freedom. To address these issues, we propose a framework called InstructRobot that maps natural language instructions into robot motion without requiring the construction of large datasets or prior knowledge of the robot's kinematics model. InstructRobot employs a reinforcement learning algorithm that enables joint learning of language representations and inverse kinematics model, simplifying the entire learning process. The proposed framework is validated using a complex robot with 26 revolute joints in object manipulation tasks, demonstrating its robustness and adaptability in realistic environments. The framework can be applied to any task or domain where datasets are scarce and difficult to create, making it an intuitive and accessible solution to the challenges of training robots using linguistic communication. Open source code for the InstructRobot framework and experiments can be accessed at https://github.com/icleveston/InstructRobot.
中文:InstructRobot框架通过强化学习将自然语言指令转化为机器人动作,无需大量数据集或预先了解机器人运动学模型,并在复杂机器人上验证了其在实际环境中的有效性和适应性。
English: The InstructRobot framework enables robots to translate natural language commands into physical actions using reinforcement learning, eliminating the need for large datasets or prior kinematic knowledge and demonstrating effectiveness with complex robots in real-world tasks.

Authors:Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
Title: S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
Abstract:
Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at https://github.com/NineAbyss/S2R.
中文: 本文提出S²R框架,通过教导模型在推理过程中自我验证与自我修正来增强大语言模型的推理能力,仅需少量训练数据即可显著提升准确率。
English: This paper introduces S²R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference, achieving significant accuracy improvements with minimal training data.

Authors:Gianluca Guglielmo, Marc Masana
Title: Leveraging Intermediate Representations for Better Out-of-Distribution Detection
Abstract:
In real-world applications, machine learning models must reliably detect Out-of-Distribution (OoD) samples to prevent unsafe decisions. Current OoD detection methods often rely on analyzing the logits or the embeddings of the penultimate layer of a neural network. However, little work has been conducted on the exploitation of the rich information encoded in intermediate layers. To address this, we analyze the discriminative power of intermediate layers and show that they can positively be used for OoD detection. Therefore, we propose to regularize intermediate layers with an energy-based contrastive loss, and by grouping multiple layers in a single aggregated response. We demonstrate that intermediate layer activations improves OoD detection performance by running a comprehensive evaluation across multiple datasets.
中文: 机器学习模型可通过基于能量的对比损失和层聚合方法,有效利用中间层信息来提升分布外样本检测性能,多数据集验证了其有效性。
English: Machine learning models can enhance Out-of-Distribution detection by utilizing intermediate layers' information through energy-based contrastive loss and layer aggregation, as validated across multiple datasets.

Authors:Yanru Sun, Zongxia Xie, Haoyu Xing, Hualong Yu, Qinghua Hu
Title: PPGF: Probability Pattern-Guided Time Series Forecasting
Abstract:
Time series forecasting (TSF) is an essential branch of machine learning with various applications. Most methods for TSF focus on constructing different networks to extract better information and improve performance. However, practical application data contain different internal mechanisms, resulting in a mixture of multiple patterns. That is, the model's ability to fit different patterns is different and generates different errors. In order to solve this problem, we propose an end-to-end framework, namely probability pattern-guided time series forecasting (PPGF). PPGF reformulates the TSF problem as a forecasting task guided by probabilistic pattern classification. Firstly, we propose the grouping strategy to approach forecasting problems as classification and alleviate the impact of data imbalance on classification. Secondly, we predict in the corresponding class interval to guarantee the consistency of classification and forecasting. In addition, True Class Probability (TCP) is introduced to pay more attention to the difficult samples to improve the classification accuracy. Detailedly, PPGF classifies the different patterns to determine which one the target value may belong to and estimates it accurately in the corresponding interval. To demonstrate the effectiveness of the proposed framework, we conduct extensive experiments on real-world datasets, and PPGF achieves significant performance improvements over several baseline methods. Furthermore, the effectiveness of TCP and the necessity of consistency between classification and forecasting are proved in the experiments. All data and codes are available online: https://github.com/syrGitHub/PPGF.
Chinese: 本文提出PPGF框架,将时间序列预测重构为概率模式分类任务,通过分组策略和真实类别概率处理混合数据模式,实验证明其有效提升了预测性能。
English: The authors propose PPGF, an end-to-end framework that reformulates time series forecasting as a probabilistic pattern classification task to handle mixed data patterns and improve accuracy through grouping strategies and true class probability.

Authors:Tanqiu Jiang, Changjiang Li, Fenglong Ma, Ting Wang
Title: RAPID: Retrieval Augmented Training of Differentially Private Diffusion Models
Abstract:
Differentially private diffusion models (DPDMs) harness the remarkable generative capabilities of diffusion models while enforcing differential privacy (DP) for sensitive data. However, existing DPDM training approaches often suffer from significant utility loss, large memory footprint, and expensive inference cost, impeding their practical uses. To overcome such limitations, we present RAPID: Retrieval Augmented PrIvate Diffusion model, a novel approach that integrates retrieval augmented generation (RAG) into DPDM training. Specifically, RAPID leverages available public data to build a knowledge base of sample trajectories; when training the diffusion model on private data, RAPID computes the early sampling steps as queries, retrieves similar trajectories from the knowledge base as surrogates, and focuses on training the later sampling steps in a differentially private manner. Extensive evaluation using benchmark datasets and models demonstrates that, with the same privacy guarantee, RAPID significantly outperforms state-of-the-art approaches by large margins in generative quality, memory footprint, and inference cost, suggesting that retrieval-augmented DP training represents a promising direction for developing future privacy-preserving generative models. The code is available at: https://github.com/TanqiuJiang/RAPID
中文: RAPID提出了一种检索增强的差分隐私扩散模型方法,利用公共数据提升训练效率,在同等隐私保护下显著提高了生成质量并降低了内存和计算成本。
English: RAPID introduces a retrieval-augmented approach to differentially private diffusion models, leveraging public data to enhance training efficiency and significantly improving generative quality while reducing memory and computational costs under the same privacy guarantees.

Authors:Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan
Title: VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation
Abstract:
The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws. VidCapBench then partitions these key information attributes into automatically assessable and manually assessable subsets, catering to both the rapid evaluation needs of agile development and the accuracy requirements of thorough validation. By evaluating numerous state-of-the-art captioning models, we demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches. Verification with off-the-shelf T2V models reveals a significant positive correlation between scores on VidCapBench and the T2V quality evaluation metrics, indicating that VidCapBench can provide valuable guidance for training T2V models. The project is available at https://github.com/VidCapBench/VidCapBench.
中文摘要:本文提出专为文本到视频生成设计的视频字幕评估框架VidCapBench,通过自动与人工评估相结合的方式全面衡量视频多维度特征,验证其与T2V模型质量指标存在显著正相关性。
English Summary: This paper introduces VidCapBench, a video caption evaluation framework designed for text-to-video generation that combines automated and manual assessment across multiple video attributes, demonstrating strong correlation with T2V model quality metrics.

Authors:Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
Title: R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs
Abstract:
Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at https://github.com/ekrxjwh2009/R2-KG/.
中文摘要:近期研究将大语言模型与知识图谱结合以提升推理准确性并减少幻觉,但存在适应性差和依赖高容量模型的问题,为此提出R2-KG双代理框架,通过角色分工和弃权机制实现高效可靠的推理。
English Summary: Recent research integrates LLMs with Knowledge Graphs to boost reasoning accuracy and reduce hallucinations, yet faces issues with adaptability and reliance on high-capacity models, prompting the development of R2-KG, a dual-agent framework that enhances cost-efficiency and reliability through role separation and an abstention mechanism.

Authors:Fabian Bongratz, Yitong Li, Sama Elbaroudy, Christian Wachinger
Title: 3D Shape-to-Image Brownian Bridge Diffusion for Brain MRI Synthesis from Cortical Surfaces
Abstract:
Despite recent advances in medical image generation, existing methods struggle to produce anatomically plausible 3D structures. In synthetic brain magnetic resonance images (MRIs), characteristic fissures are often missing, and reconstructed cortical surfaces appear scattered rather than densely convoluted. To address this issue, we introduce Cor2Vox, the first diffusion model-based method that translates continuous cortical shape priors to synthetic brain MRIs. To achieve this, we leverage a Brownian bridge process which allows for direct structured mapping between shape contours and medical images. Specifically, we adapt the concept of the Brownian bridge diffusion model to 3D and extend it to embrace various complementary shape representations. Our experiments demonstrate significant improvements in the geometric accuracy of reconstructed structures compared to previous voxel-based approaches. Moreover, Cor2Vox excels in image quality and diversity, yielding high variation in non-target structures like the skull. Finally, we highlight the capability of our approach to simulate cortical atrophy at the sub-voxel level. Our code is available at https://github.com/ai-med/Cor2Vox.
中文:Cor2Vox提出了一种创新的扩散模型,能将皮层形状先验转化为解剖结构精确的3D脑部核磁共振图像,显著提升了几何精度,并能实现亚体素级别的皮层萎缩模拟。
English: Cor2Vox introduces a novel diffusion model that translates cortical shape priors into anatomically accurate 3D brain MRIs, significantly improving geometric precision and enabling sub-voxel simulation of cortical atrophy.

Authors:Shengxiang Gao, Jey Han Lau, Jianzhong Qi
Title: Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation
Abstract:
Knowledge base question answering (KBQA) aims to answer user questions in natural language using rich human knowledge stored in large KBs. As current KBQA methods struggle with unseen knowledge base elements at test time,we introduce SG-KBQA: a novel model that injects schema contexts into entity retrieval and logical form generation to tackle this issue. It uses the richer semantics and awareness of the knowledge base structure provided by schema contexts to enhance generalizability. We show that SG-KBQA achieves strong generalizability, outperforming state-of-the-art models on two commonly used benchmark datasets across a variety of test settings. Our source code is available at https://github.com/gaosx2000/SG_KBQA.
中文:SG-KBQA是一种新颖模型,通过将知识库模式上下文融入实体检索和逻辑形式生成,有效提升了知识库问答的泛化能力,在多种测试设置下超越了现有最优方法。
English: SG-KBQA is a novel model that enhances generalizability in knowledge base question answering by incorporating schema contexts into entity retrieval and logical form generation, outperforming state-of-the-art methods on benchmark datasets.

Authors:Yuanfan Li, Zhaohan Zhang, Chengzhengxu Li, Chao Shen, Xiaoming Liu
Title: Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training
Abstract:
Machine-generated Text (MGT) detection is crucial for regulating and attributing online texts. While the existing MGT detectors achieve strong performance, they remain vulnerable to simple perturbations and adversarial attacks. To build an effective defense against malicious perturbations, we view MGT detection from a threat modeling perspective, that is, analyzing the model's vulnerability from an adversary's point of view and exploring effective mitigations. To this end, we introduce an adversarial framework for training a robust MGT detector, named GREedy Adversary PromoTed DefendER (GREATER). The GREATER consists of two key components: an adversary GREATER-A and a detector GREATER-D. The GREATER-D learns to defend against the adversarial attack from GREATER-A and generalizes the defense to other attacks. GREATER-A identifies and perturbs the critical tokens in embedding space, along with greedy search and pruning to generate stealthy and disruptive adversarial examples. Besides, we update the GREATER-A and GREATER-D synchronously, encouraging the GREATER-D to generalize its defense to different attacks and varying attack intensities. Our experimental results across 10 text perturbation strategies and 6 adversarial attacks show that our GREATER-D reduces the Attack Success Rate (ASR) by 0.67% compared with SOTA defense methods while our GREATER-A is demonstrated to be more effective and efficient than SOTA attack approaches. Codes and dataset are available in https://github.com/Liyuuuu111/GREATER.
中文摘要:本研究提出GREATER对抗训练框架,通过同步优化检测器防御能力和对抗样本生成策略,显著提升了机器生成文本检测的鲁棒性,在多种攻击场景下均优于现有方法。
English Summary: The study introduces GREATER, an adversarial training framework that enhances machine-generated text detection by simultaneously strengthening the detector against attacks and refining adversarial examples to improve robustness across various perturbation strategies.

Authors:Haoyuan Wu, Haisheng Zheng, Yuan Pu, Bei Yu
Title: Circuit Representation Learning with Masked Gate Modeling and Verilog-AIG Alignment
Abstract:
Understanding the structure and function of circuits is crucial for electronic design automation (EDA). Circuits can be formulated as And-Inverter graphs (AIGs), enabling efficient implementation of representation learning through graph neural networks (GNNs). Masked modeling paradigms have been proven effective in graph representation learning. However, masking augmentation to original circuits will destroy their logical equivalence, which is unsuitable for circuit representation learning. Moreover, existing masked modeling paradigms often prioritize structural information at the expense of abstract information such as circuit function. To address these limitations, we introduce MGVGA, a novel constrained masked modeling paradigm incorporating masked gate modeling (MGM) and Verilog-AIG alignment (VGA). Specifically, MGM preserves logical equivalence by masking gates in the latent space rather than in the original circuits, subsequently reconstructing the attributes of these masked gates. Meanwhile, large language models (LLMs) have demonstrated an excellent understanding of the Verilog code functionality. Building upon this capability, VGA performs masking operations on original circuits and reconstructs masked gates under the constraints of equivalent Verilog codes, enabling GNNs to learn circuit functions from LLMs. We evaluate MGVGA on various logic synthesis tasks for EDA and show the superior performance of MGVGA compared to previous state-of-the-art methods. Our code is available at https://github.com/wuhy68/MGVGA.
中文: 本文提出MGVGA,一种新颖的约束掩码建模范式,通过在潜在空间进行掩码门建模并结合Verilog-AIG对齐来保持电路表示学习中的逻辑等价性,在EDA任务中展现出优越性能。
English: The paper introduces MGVGA, a novel constrained masked modeling paradigm that preserves logical equivalence in circuit representation learning by combining masked gate modeling in latent space and Verilog-AIG alignment, demonstrating superior performance in EDA tasks.

Authors:Thierry Judge, Olivier Bernard, Woo-Jin Cho Kim, Alberto Gomez, Arian Beqiri, Agisilaos Chartsias, Pierre-Marc Jodoin
Title: Uncertainty Propagation for Echocardiography Clinical Metric Estimation via Contour Sampling
Abstract:
Echocardiography plays a fundamental role in the extraction of important clinical parameters (e.g. left ventricular volume and ejection fraction) required to determine the presence and severity of heart-related conditions. When deploying automated techniques for computing these parameters, uncertainty estimation is crucial for assessing their utility. Since clinical parameters are usually derived from segmentation maps, there is no clear path for converting pixel-wise uncertainty values into uncertainty estimates in the downstream clinical metric calculation. In this work, we propose a novel uncertainty estimation method based on contouring rather than segmentation. Our method explicitly predicts contour location uncertainty from which contour samples can be drawn. Finally, the sampled contours can be used to propagate uncertainty to clinical metrics. Our proposed method not only provides accurate uncertainty estimations for the task of contouring but also for the downstream clinical metrics on two cardiac ultrasound datasets. Code is available at: https://github.com/ThierryJudge/contouring-uncertainty.
中文: 本研究提出了一种基于轮廓检测的新型不确定性估计方法,能准确预测轮廓位置的不确定性并将其传递至临床指标,在心脏超声数据集上优于基于分割的方法。
English: This study introduces a novel uncertainty estimation method based on contouring, which accurately predicts contour location uncertainty and propagates it to clinical metrics, outperforming segmentation-based approaches on cardiac ultrasound datasets.

Authors:Timon Winter, Stanislav Frolov, Brian Bernhard Moser, Andreas Dengel
Title: Spherical Dense Text-to-Image Synthesis
Abstract:
Recent advancements in text-to-image (T2I) have improved synthesis results, but challenges remain in layout control and generating omnidirectional panoramic images. Dense T2I (DT2I) and spherical T2I (ST2I) models address these issues, but so far no unified approach exists. Trivial approaches, like prompting a DT2I model to generate panoramas can not generate proper spherical distortions and seamless transitions at the borders. Our work shows that spherical dense text-to-image (SDT2I) can be achieved by integrating training-free DT2I approaches into finetuned panorama models. Specifically, we propose MultiStitchDiffusion (MSTD) and MultiPanFusion (MPF) by integrating MultiDiffusion into StitchDiffusion and PanFusion, respectively. Since no benchmark for SDT2I exists, we further construct Dense-Synthetic-View (DSynView), a new synthetic dataset containing spherical layouts to evaluate our models. Our results show that MSTD outperforms MPF across image quality as well as prompt- and layout adherence. MultiPanFusion generates more diverse images but struggles to synthesize flawless foreground objects. We propose bootstrap-coupling and turning off equirectangular perspective-projection attention in the foreground as an improvement of MPF. Link to code https://github.com/sdt2i/spherical-dense-text-to-image
中文: 本研究提出了MultiStitchDiffusion和MultiPanFusion两种统一方法,通过整合现有模型实现球形密集文本到图像生成,解决了布局控制和全景无缝合成的难题,其中MSTD在图像质量与内容贴合度方面表现更优。
English: This work introduces MultiStitchDiffusion and MultiPanFusion as unified approaches for spherical dense text-to-image generation, addressing layout control and seamless panoramic synthesis through integration with existing models, with MSTD showing superior performance in quality and adherence metrics.

Authors:Jianping Li, Zhongyuan Liu, Xinhang Xu, Jinxin Liu, Shenghai Yuan, Fang Xu, Lihua Xie
Title: LiMo-Calib: On-Site Fast LiDAR-Motor Calibration for Quadruped Robot-Based Panoramic 3D Sensing System
Abstract:
Conventional single LiDAR systems are inherently constrained by their limited field of view (FoV), leading to blind spots and incomplete environmental awareness, particularly on robotic platforms with strict payload limitations. Integrating a motorized LiDAR offers a practical solution by significantly expanding the sensor's FoV and enabling adaptive panoramic 3D sensing. However, the high-frequency vibrations of the quadruped robot introduce calibration challenges, causing variations in the LiDAR-motor transformation that degrade sensing accuracy. Existing calibration methods that use artificial targets or dense feature extraction lack feasibility for on-site applications and real-time implementation. To overcome these limitations, we propose LiMo-Calib, an efficient on-site calibration method that eliminates the need for external targets by leveraging geometric features directly from raw LiDAR scans. LiMo-Calib optimizes feature selection based on normal distribution to accelerate convergence while maintaining accuracy and incorporates a reweighting mechanism that evaluates local plane fitting quality to enhance robustness. We integrate and validate the proposed method on a motorized LiDAR system mounted on a quadruped robot, demonstrating significant improvements in calibration efficiency and 3D sensing accuracy, making LiMo-Calib well-suited for real-world robotic applications. We further demonstrate the accuracy improvements of the LIO on the panoramic 3D sensing system using the calibrated parameters. The code will be available at: https://github.com/kafeiyin00/LiMo-Calib.
中文: LiMo-Calib是一种高效的现场校准方法,无需外部标定物,直接利用原始激光雷达扫描的几何特征,显著提高了四足机器人上旋转激光雷达系统的校准效率和三维感知精度。
English: LiMo-Calib is an efficient on-site calibration method that eliminates the need for external targets by leveraging geometric features from raw LiDAR scans, significantly improving calibration efficiency and 3D sensing accuracy for motorized LiDAR systems on quadruped robots.

Authors:Oğuzhan Canpolat, A. Giray Yağlıkçı, Geraldo F. Oliveira, Ataberk Olgun, Nisa Bostancı, İsmail Emir Yüksel, Haocong Luo, Oğuz Ergin, Onur Mutlu
Title: Chronus: Understanding and Securing the Cutting-Edge Industry Solutions to DRAM Read Disturbance
Abstract:
We 1) present the first rigorous security, performance, energy, and cost analyses of the state-of-the-art on-DRAM-die read disturbance mitigation method, Per Row Activation Counting (PRAC) and 2) propose Chronus, a new mechanism that addresses PRAC's two major weaknesses. Our analysis shows that PRAC's system performance overhead on benign applications is non-negligible for modern DRAM chips and prohibitively large for future DRAM chips that are more vulnerable to read disturbance. We identify two weaknesses of PRAC that cause these overheads. First, PRAC increases critical DRAM access latency parameters due to the additional time required to increment activation counters. Second, PRAC performs a constant number of preventive refreshes at a time, making it vulnerable to an adversarial access pattern, known as the wave attack, and consequently requiring it to be configured for significantly smaller activation thresholds. To address PRAC's two weaknesses, we propose a new on-DRAM-die RowHammer mitigation mechanism, Chronus. Chronus 1) updates row activation counters concurrently while serving accesses by separating counters from the data and 2) prevents the wave attack by dynamically controlling the number of preventive refreshes performed. Our performance analysis shows that Chronus's system performance overhead is near-zero for modern DRAM chips and very low for future DRAM chips. Chronus outperforms three variants of PRAC and three other state-of-the-art read disturbance solutions. We discuss Chronus's and PRAC's implications for future systems and foreshadow future research directions. To aid future research, we open-source our Chronus implementation at https://github.com/CMU-SAFARI/Chronus.
Chinese: 本研究批判性分析了PRAC在现代及未来DRAM芯片中的显著性能缺陷,并提出新型机制Chronus通过并行计数器更新和动态刷新控制克服这些弱点,实现了近乎零的性能开销。
English: This study critically analyzes PRAC's significant performance limitations in modern and future DRAM chips and introduces Chronus, a novel mechanism that overcomes these weaknesses through concurrent counter updates and dynamic refresh control, achieving near-zero overhead.

Authors:Zhiyuan Liu, Yanchen Luo, Han Huang, Enzhi Zhang, Sihang Li, Junfeng Fang, Yaorui Shi, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua
Title: NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation
Abstract:
3D molecule generation is crucial for drug discovery and material design. While prior efforts focus on 3D diffusion models for their benefits in modeling continuous 3D conformers, they overlook the advantages of 1D SELFIES-based Language Models (LMs), which can generate 100% valid molecules and leverage the billion-scale 1D molecule datasets. To combine these advantages for 3D molecule generation, we propose a foundation model -- NExT-Mol: 3D Diffusion Meets 1D Language Modeling for 3D Molecule Generation. NExT-Mol uses an extensively pretrained molecule LM for 1D molecule generation, and subsequently predicts the generated molecule's 3D conformers with a 3D diffusion model. We enhance NExT-Mol's performance by scaling up the LM's model size, refining the diffusion neural architecture, and applying 1D to 3D transfer learning. Notably, our 1D molecule LM significantly outperforms baselines in distributional similarity while ensuring validity, and our 3D diffusion model achieves leading performances in conformer prediction. Given these improvements in 1D and 3D modeling, NExT-Mol achieves a 26% relative improvement in 3D FCD for de novo 3D generation on GEOM-DRUGS, and a 13% average relative gain for conditional 3D generation on QM9-2014. Our codes and pretrained checkpoints are available at https://github.com/acharkq/NExT-Mol.
中文:NExT-Mol模型结合预训练的一维语言模型生成有效分子与三维扩散模型预测构象,在基准数据集上实现了从头生成和条件性三维分子生成的显著提升。
English: The NExT-Mol model integrates a pretrained 1D language model for generating valid molecules with a 3D diffusion model for predicting conformers, achieving significant improvements in both de novo and conditional 3D molecule generation on benchmark datasets.

Authors:Mingyang Sun, Pengxiang Ding, Weinan Zhang, Donglin Wang
Title: Score-Based Diffusion Policy Compatible with Reinforcement Learning via Optimal Transport
Abstract:
Diffusion policies have shown promise in learning complex behaviors from demonstrations, particularly for tasks requiring precise control and long-term planning. However, they face challenges in robustness when encountering distribution shifts. This paper explores improving diffusion-based imitation learning models through online interactions with the environment. We propose OTPR (Optimal Transport-guided score-based diffusion Policy for Reinforcement learning fine-tuning), a novel method that integrates diffusion policies with RL using optimal transport theory. OTPR leverages the Q-function as a transport cost and views the policy as an optimal transport map, enabling efficient and stable fine-tuning. Moreover, we introduce masked optimal transport to guide state-action matching using expert keypoints and a compatibility-based resampling strategy to enhance training stability. Experiments on three simulation tasks demonstrate OTPR's superior performance and robustness compared to existing methods, especially in complex and sparse-reward environments. In sum, OTPR provides an effective framework for combining IL and RL, achieving versatile and reliable policy learning. The code will be released at https://github.com/Sunmmyy/OTPR.git.
中文: 本文提出OTPR方法,通过最优传输理论将扩散策略与强化学习相结合,有效提升了模仿学习的鲁棒性和性能,尤其在复杂和稀疏奖励环境中表现优异。
English: This paper introduces OTPR, a novel method that integrates diffusion policies with reinforcement learning using optimal transport theory to enhance robustness and performance in imitation learning, particularly in complex and sparse-reward environments.

Authors:Tanzhe Li, Caoshuo Li, Jiayi Lyu, Hongjuan Pei, Baochang Zhang, Taisong Jin, Rongrong Ji
Title: DAMamba: Vision State Space Model with Dynamic Adaptive Scan
Abstract:
State space models (SSMs) have recently garnered significant attention in computer vision. However, due to the unique characteristics of image data, adapting SSMs from natural language processing to computer vision has not outperformed the state-of-the-art convolutional neural networks (CNNs) and Vision Transformers (ViTs). Existing vision SSMs primarily leverage manually designed scans to flatten image patches into sequences locally or globally. This approach disrupts the original semantic spatial adjacency of the image and lacks flexibility, making it difficult to capture complex image structures. To address this limitation, we propose Dynamic Adaptive Scan (DAS), a data-driven method that adaptively allocates scanning orders and regions. This enables more flexible modeling capabilities while maintaining linear computational complexity and global modeling capacity. Based on DAS, we further propose the vision backbone DAMamba, which significantly outperforms current state-of-the-art vision Mamba models in vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. Notably, it surpasses some of the latest state-of-the-art CNNs and ViTs. Code will be available at https://github.com/ltzovo/DAMamba.
中文: 提出的动态自适应扫描(DAS)方法自适应分配扫描顺序和区域,克服了现有视觉状态空间模型的局限,并基于此开发了DAMamba视觉主干网络,在多项视觉任务中显著超越了当前最先进模型。
English: The proposed Dynamic Adaptive Scan (DAS) method adaptively allocates scanning orders and regions to overcome the limitations of existing vision SSMs, leading to the development of DAMamba, a vision backbone that outperforms current state-of-the-art models in various vision tasks.

Authors:Lu Yang, Jiajia Li, En Ci, Lefei Zhang, Zuchao Li, Ping Wang
Title: Label Drop for Multi-Aspect Relation Modeling in Universal Information Extraction
Abstract:
Universal Information Extraction (UIE) has garnered significant attention due to its ability to address model explosion problems effectively. Extractive UIE can achieve strong performance using a relatively small model, making it widely adopted. Extractive UIEs generally rely on task instructions for different tasks, including single-target instructions and multiple-target instructions. Single-target instruction UIE enables the extraction of only one type of relation at a time, limiting its ability to model correlations between relations and thus restricting its capability to extract complex relations. While multiple-target instruction UIE allows for the extraction of multiple relations simultaneously, the inclusion of irrelevant relations introduces decision complexity and impacts extraction accuracy. Therefore, for multi-relation extraction, we propose LDNet, which incorporates multi-aspect relation modeling and a label drop mechanism. By assigning different relations to different levels for understanding and decision-making, we reduce decision confusion. Additionally, the label drop mechanism effectively mitigates the impact of irrelevant relations. Experiments show that LDNet outperforms or achieves competitive performance with state-of-the-art systems on 9 tasks, 33 datasets, in both single-modal and multi-modal, few-shot and zero-shot settings.\footnote{https://github.com/Lu-Yang666/LDNet}
中文: 提出的LDNet通过多角度关系建模和标签丢弃机制,有效减少决策混淆并降低无关关系影响,在多种任务和场景下展现出优于或媲美先进系统的性能。
English: LDNet is proposed to enhance multi-relation extraction by employing multi-aspect relation modeling and a label drop mechanism, which reduces decision confusion and mitigates irrelevant relation impacts, demonstrating superior or competitive performance across diverse tasks and settings.

Authors:Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li
Title: G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation
Abstract:
Explainable recommendation has demonstrated significant advantages in informing users about the logic behind recommendations, thereby increasing system transparency, effectiveness, and trustworthiness. To provide personalized and interpretable explanations, existing works often combine the generation capabilities of large language models (LLMs) with collaborative filtering (CF) information. CF information extracted from the user-item interaction graph captures the user behaviors and preferences, which is crucial for providing informative explanations. However, due to the complexity of graph structure, effectively extracting the CF information from graphs still remains a challenge. Moreover, existing methods often struggle with the integration of extracted CF information with LLMs due to its implicit representation and the modality gap between graph structures and natural language explanations. To address these challenges, we propose G-Refer, a framework using graph retrieval-augmented large language models (LLMs) for explainable recommendation. Specifically, we first employ a hybrid graph retrieval mechanism to retrieve explicit CF signals from both structural and semantic perspectives. The retrieved CF information is explicitly formulated as human-understandable text by the proposed graph translation and accounts for the explanations generated by LLMs. To bridge the modality gap, we introduce knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of LLMs to process and utilize the retrieved CF information to generate explanations. Extensive experiments show that G-Refer achieves superior performance compared with existing methods in both explainability and stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer.
中文: G-Refer框架通过混合图检索机制提取显式协同过滤信号,并借助知识剪枝和微调技术将其与大语言模型融合,在可解释性和稳定性方面实现了优越性能。
English: The G-Refer framework enhances explainable recommendations by using a hybrid graph retrieval mechanism to extract explicit collaborative filtering signals and integrating them with large language models through knowledge pruning and fine-tuning, achieving superior performance in explainability and stability.

Authors:Juefeng Xiao, Tianqi Xiang, Zhigang Tu
Title: Adaptive Prototype Model for Attribute-based Multi-label Few-shot Action Recognition
Abstract:
In real-world action recognition systems, incorporating more attributes helps achieve a more comprehensive understanding of human behavior. However, using a single model to simultaneously recognize multiple attributes can lead to a decrease in accuracy. In this work, we propose a novel method i.e. Adaptive Attribute Prototype Model (AAPM) for human action recognition, which captures rich action-relevant attribute information and strikes a balance between accuracy and robustness. Firstly, we introduce the Text-Constrain Module (TCM) to incorporate textual information from potential labels, and constrain the construction of different attributes prototype representations. In addition, we explore the Attribute Assignment Method (AAM) to address the issue of training bias and increase robustness during the training process.Furthermore, we construct a new video dataset with attribute-based multi-label called Multi-Kinetics for evaluation, which contains various attribute labels (e.g. action, scene, object, etc.) related to human behavior. Extensive experiments demonstrate that our AAPM achieves the state-of-the-art performance in both attribute-based multi-label few-shot action recognition and single-label few-shot action recognition. The project and dataset are available at an anonymous account https://github.com/theAAPM/AAPM
中文: 自适应属性原型模型(AAPM)通过引入文本约束模块和属性分配方法,有效提升了多属性动作识别的准确性和鲁棒性,在新建的Multi-Kinetics数据集上实现了多标签和单标签小样本识别的最优性能。
English: The Adaptive Attribute Prototype Model (AAPM) enhances human action recognition by integrating textual constraints and addressing training bias, achieving state-of-the-art performance in both multi-label and single-label few-shot tasks, as validated on the new Multi-Kinetics dataset.

Authors:Minghao Fu, Guo-Hua Wang, Liangfu Cao, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang
Title: CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation
Abstract:
Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks.
Chinese: 本文提出CHATS框架,通过结合人类偏好对齐与测试时采样,分别建模偏好与非偏好分布,以少量高质量数据实现了文本到图像生成的卓越效果,在多项基准测试中创下新纪录。
English: This paper introduces CHATS, a novel framework that combines human preference alignment with test-time sampling to enhance text-to-image generation by modeling preferred and dispreferred distributions, achieving state-of-the-art performance with high data efficiency.

Authors:Pengyu Zhu, Zhenhong Zhou, Yuanhe Zhang, Shilinlu Yan, Kun Wang, Sen Su
Title: DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent
Abstract:
As LLM-based agents become increasingly prevalent, backdoors can be implanted into agents through user queries or environment feedback, raising critical concerns regarding safety vulnerabilities. However, backdoor attacks are typically detectable by safety audits that analyze the reasoning process of agents. To this end, we propose a novel backdoor implantation strategy called \textbf{Dynamically Encrypted Multi-Backdoor Implantation Attack}. Specifically, we introduce dynamic encryption, which maps the backdoor into benign content, effectively circumventing safety audits. To enhance stealthiness, we further decompose the backdoor into multiple sub-backdoor fragments. Based on these advancements, backdoors are allowed to bypass safety audits significantly. Additionally, we present AgentBackdoorEval, a dataset designed for the comprehensive evaluation of agent backdoor attacks. Experimental results across multiple datasets demonstrate that our method achieves an attack success rate nearing 100\% while maintaining a detection rate of 0\%, illustrating its effectiveness in evading safety audits. Our findings highlight the limitations of existing safety mechanisms in detecting advanced attacks, underscoring the urgent need for more robust defenses against backdoor threats. Code and data are available at https://github.com/whfeLingYu/DemonAgent.
中文摘要:该研究提出了一种名为动态加密多后门植入的新型攻击方法,通过将后门加密为良性内容并分割成多个片段来规避安全审计,实现了接近100%的攻击成功率与零检测率。
English Summary: The study introduces a novel backdoor attack method called Dynamically Encrypted Multi-Backdoor Implantation that evades safety audits by encrypting backdoors as benign content and splitting them into fragments, achieving near-perfect attack success with zero detection rates.

Authors:Chao Yang, Yong Fan, Cheng Lu, Minghao Yuan, Zhijing Yang
Title: GVTNet: Graph Vision Transformer For Face Super-Resolution
Abstract:
Recent advances in face super-resolution research have utilized the Transformer architecture. This method processes the input image into a series of small patches. However, because of the strong correlation between different facial components in facial images. When it comes to super-resolution of low-resolution images, existing algorithms cannot handle the relationships between patches well, resulting in distorted facial components in the super-resolution results. To solve the problem, we propose a transformer architecture based on graph neural networks called graph vision transformer network. We treat each patch as a graph node and establish an adjacency matrix based on the information between patches. In this way, the patch only interacts between neighboring patches, further processing the relationship of facial components. Quantitative and visualization experiments have underscored the superiority of our algorithm over state-of-the-art techniques. Through detailed comparisons, we have demonstrated that our algorithm possesses more advanced super-resolution capabilities, particularly in enhancing facial components. The PyTorch code is available at https://github.com/continueyang/GVTNet
中文摘要:提出的图视觉Transformer网络通过将图像块视为图节点并利用邻接矩阵优化块间交互,有效解决了超分辨率中面部组件失真的问题,展现出优于现有技术的性能。
English Summary: The proposed Graph Vision Transformer Network addresses facial component distortion in super-resolution by treating image patches as graph nodes and using an adjacency matrix to optimize patch interactions, demonstrating superior performance over existing methods.

Authors:Kaiyang Wan, Honglin Mu, Rui Hao, Haoran Luo, Tianle Gu, Xiuying Chen
Title: A Cognitive Writing Perspective for Constrained Long-Form Text Generation
Abstract:
Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form text that adheres to strict requirements in a single pass. This challenge is unsurprising, as successful human writing, according to the Cognitive Writing Theory, is a complex cognitive process involving iterative planning, translating, reviewing, and monitoring. Motivated by these cognitive principles, we aim to equip LLMs with human-like cognitive writing capabilities through CogWriter, a novel training-free framework that transforms LLM constrained long-form text generation into a systematic cognitive writing paradigm. Our framework consists of two key modules: (1) a Planning Agent that performs hierarchical planning to decompose the task, and (2) multiple Generation Agents that execute these plans in parallel. The system maintains quality via continuous monitoring and reviewing mechanisms, which evaluate outputs against specified requirements and trigger necessary revisions. CogWriter demonstrates exceptional performance on LongGenBench, a benchmark for complex constrained long-form text generation. Even when using Qwen-2.5-14B as its backbone, CogWriter surpasses GPT-4o by 22% in complex instruction completion accuracy while reliably generating texts exceeding 10,000 words. We hope this cognitive science-inspired approach provides a paradigm for LLM writing advancements: \href{https://github.com/KaiyangWan/CogWriter}{CogWriter}.
中文: CogWriter提出了一种无需训练的框架,通过模拟人类认知写作过程,结合分层规划、并行生成和持续监控,使大语言模型在复杂长文本生成中表现出色,其准确性和文本长度均大幅超越GPT-4o。
English: CogWriter introduces a training-free framework that mimics human cognitive writing processes, enabling LLMs to excel in complex long-form text generation by integrating hierarchical planning, parallel generation, and continuous monitoring, significantly outperforming GPT-4o in accuracy and length.

Authors:Chao Yang, Yong Fan, Qichao Zhang, Cheng Lu, Zhijing Yang
Title: DeltaDiff: Reality-Driven Diffusion with AnchorResiduals for Faithful SR
Abstract:
Recently, the transfer application of diffusion models in super-resolu-tion tasks has faced the problem ofdecreased fidelity. Due to the inherent randomsampling characteristics ofdiffusion models, direct application in super-resolu-tion tasks can result in generated details deviating from the true distribution ofhigh-resolution images. To address this, we propose DeltaDiff, a novel frame.work that constrains the difusion process, its essence is to establish a determin-istic mapping path between HR and LR, rather than the random noise disturbanceprocess oftraditional difusion models. Theoretical analysis demonstrates a 25%reduction in diffusion entropy in the residual space compared to pixel-space diffiusion, effectively suppressing irrelevant noise interference. The experimentalresults show that our method surpasses state-of-the-art models and generates re-sults with better fidelity. This work establishes a new low-rank constrained par-adigm for applying diffusion models to image reconstruction tasks, balancingstochastic generation with structural fidelity. Our code and model are publiclyavailable at https://github.com/continueyang/DeltaDiff .
中文:DeltaDiff通过建立高分辨率与低分辨率图像间的确定性映射路径,将扩散熵降低25%,在超分辨率任务中有效抑制噪声干扰,以超越现有最优模型的保真度生成结果。
English: DeltaDiff introduces a deterministic mapping between high- and low-resolution images to reduce diffusion entropy by 25%, surpassing state-of-the-art models in fidelity by constraining noise interference in super-resolution tasks.

Authors:Weikai Lu, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng
Title: SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
Abstract:
Multimodal Large Language Models (MLLMs) have serious security vulnerabilities.While safety alignment using multimodal datasets consisting of text and data of additional modalities can effectively enhance MLLM's security, it is costly to construct these datasets. Existing low-resource security alignment methods, including textual alignment, have been found to struggle with the security risks posed by additional modalities. To address this, we propose Synthetic Embedding augmented safety Alignment (SEA), which optimizes embeddings of additional modality through gradient updates to expand textual datasets. This enables multimodal safety alignment training even when only textual data is available. Extensive experiments on image, video, and audio-based MLLMs demonstrate that SEA can synthesize a high-quality embedding on a single RTX3090 GPU within 24 seconds. SEA significantly improves the security of MLLMs when faced with threats from additional modalities. To assess the security risks introduced by video and audio, we also introduced a new benchmark called VA-SafetyBench. High attack success rates across multiple MLLMs validate its challenge. Our code and data will be available at https://github.com/ZeroNLP/SEA.
Chinese: 提出的合成嵌入增强安全对齐(SEA)方法通过仅使用文本数据优化多模态嵌入,有效提升多模态大语言模型的安全性,能以较低计算成本应对跨模态威胁。
English: The proposed Synthetic Embedding augmented safety Alignment (SEA) method enhances multimodal large language models' security by optimizing embeddings for additional modalities using only textual data, effectively countering cross-modal threats while being computationally efficient.

Authors:Yong Zhao, Kai Xu, Zhengqiu Zhu, Yue Hu, Zhiheng Zheng, Yingfeng Chen, Yatai Ji, Chen Gao, Yong Li, Jincai Huang
Title: CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space
Abstract:
Embodied Question Answering (EQA) has primarily focused on indoor environments, leaving the complexities of urban settings-spanning environment, action, and perception-largely unexplored. To bridge this gap, we introduce CityEQA, a new task where an embodied agent answers open-vocabulary questions through active exploration in dynamic city spaces. To support this task, we present CityEQA-EC, the first benchmark dataset featuring 1,412 human-annotated tasks across six categories, grounded in a realistic 3D urban simulator. Moreover, we propose Planner-Manager-Actor (PMA), a novel agent tailored for CityEQA. PMA enables long-horizon planning and hierarchical task execution: the Planner breaks down the question answering into sub-tasks, the Manager maintains an object-centric cognitive map for spatial reasoning during the process control, and the specialized Actors handle navigation, exploration, and collection sub-tasks. Experiments demonstrate that PMA achieves 60.7% of human-level answering accuracy, significantly outperforming competitive baselines. While promising, the performance gap compared to humans highlights the need for enhanced visual reasoning in CityEQA. This work paves the way for future advancements in urban spatial intelligence. Dataset and code are available at https://github.com/BiluYong/CityEQA.git.
中文: 本文提出了CityEQA这一城市环境具身问答新任务,配套发布了CityEQA-EC基准数据集和规划者-管理者-执行者智能体,该模型达到了60.7%的人类回答准确率,但在视觉推理方面仍需提升。
English: This paper introduces CityEQA, a novel task for embodied question answering in dynamic urban environments, supported by the CityEQA-EC benchmark dataset and a Planner-Manager-Actor agent that achieves 60.7% human-level accuracy but requires improved visual reasoning.

Authors:Yunjie Tian, Qixiang Ye, David Doermann
Title: YOLOv12: Attention-Centric Real-Time Object Detectors
Abstract:
Enhancing the network architecture of the YOLO framework has been crucial for a long time, but has focused on CNN-based improvements despite the proven superiority of attention mechanisms in modeling capabilities. This is because attention-based models cannot match the speed of CNN-based models. This paper proposes an attention-centric YOLO framework, namely YOLOv12, that matches the speed of previous CNN-based ones while harnessing the performance benefits of attention mechanisms. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed. For example, YOLOv12-N achieves 40.6% mAP with an inference latency of 1.64 ms on a T4 GPU, outperforming advanced YOLOv10-N / YOLOv11-N by 2.1%/1.2% mAP with a comparable speed. This advantage extends to other model scales. YOLOv12 also surpasses end-to-end real-time detectors that improve DETR, such as RT-DETR / RT-DETRv2: YOLOv12-S beats RT-DETR-R18 / RT-DETRv2-R18 while running 42% faster, using only 36% of the computation and 45% of the parameters. More comparisons are shown in Figure 1.
中文: YOLOv12提出了一种以注意力为核心的框架,在保持竞争性速度的同时实现了更高的精度,在效率和性能上均超越了基于CNN和端到端的实时检测器。
English: YOLOv12 introduces an attention-centric framework that achieves superior accuracy with competitive speed, outperforming both CNN-based and end-to-end real-time detectors in efficiency and performance.

Authors:Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
Title: RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
Abstract:
After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. We compare our dataset with other widely used datasets of equivalent scale for CLIP training. Models pre-trained on RealSyn consistently achieve state-of-the-art performance across various downstream tasks, including linear probe, zero-shot transfer, zero-shot robustness, and zero-shot retrieval. Furthermore, extensive experiments confirm that RealSyn significantly enhances contrastive vision-language representation learning and demonstrates robust scalability. To facilitate future research, the RealSyn dataset and pretrained model weights are released at https://github.com/deepglint/RealSyn.
中文: RealSyn数据集通过分层检索方法提取高质量图像并关联语义相关文本,结合合成文本生成增强视觉信息,使CLIP模型在多种下游任务中实现最优性能,并展现出强大的可扩展性。
English: The RealSyn dataset, constructed by extracting and associating high-quality images with semantically relevant texts through a hierarchical retrieval method and enhanced with synthetic text generation, enables CLIP models to achieve state-of-the-art performance across various downstream tasks.

Authors:Liangying Shao, Yanfu Yan, Denys Poshyvanyk, Jinsong Su
Title: UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation
Abstract:
Deep learning-based code generation has completely transformed the way developers write programs today. Existing approaches to code generation have focused either on the Sequence-to-Sequence paradigm, which generates target code as a sequence of tokens, or the Sequence-to-Tree paradigm, which outputs code as a sequence of actions. While these two paradigms are intuitively complementary, their combination has not been previously explored. By comparing the code generated under these two paradigms, we find that integrating them holds significant potential. In this paper, we propose UniGenCoder for code-related generation tasks, which consists of a shared encoder, a shared decoder with a minimal set of additional parameters to unify two paradigms, and a selector that dynamically chooses optimal paradigm for each instance. Also, during the model training, we first perform the multi-task learning and distillation strategies to facilitate knowledge transfer between two paradigms, and then leverage contrastive learning to train the selector. Experimental results on the text-to-code and code-to-code generation tasks demonstrate the effectiveness of our proposed model. We release our code at https://github.com/DeepLearnXMU/UniGenCoder.
Chinese: 本文提出UniGenCoder模型,通过共享编码器和解码器结合动态选择器,首次统一了序列到序列与序列到树两种代码生成范式,并采用多任务学习和对比学习策略,在文本到代码和代码到代码任务中验证了其有效性。
English: This paper introduces UniGenCoder, a novel model that unifies the Sequence-to-Sequence and Sequence-to-Tree paradigms for code generation, employing multi-task learning, distillation, and contrastive learning to enhance performance across text-to-code and code-to-code tasks.

Authors:Xiang He, Dongcheng Zhao, Yiting Dong, Guobin Shen, Xin Yang, Yi Zeng
Title: Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning
Abstract:
Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain's information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration. Additionally, a semantic alignment optimization mechanism is introduced to align cross-modal features within a shared semantic space, improving their consistency and complementarity. Extensive experiments on three benchmark datasets CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS demonstrate that S-CMRL significantly outperforms existing multimodal SNN methods, achieving the state-of-the-art performance. The code is publicly available at https://github.com/Brain-Cog-Lab/S-CMRL.
中文:提出的S-CMRL框架采用基于Transformer的脉冲神经网络,通过跨模态残差学习和语义对齐机制有效整合视听信息,在多个基准数据集上实现了最先进的性能。
English: The proposed S-CMRL framework introduces a Transformer-based spiking neural network with cross-modal residual learning and semantic alignment to effectively integrate audio-visual information, achieving state-of-the-art performance on multiple benchmark datasets.

Authors:Xiaoqian Liu, Ke Wang, Yongbin Li, Yuchuan Wu, Wentao Ma, Aobo Kong, Fei Huang, Jianbin Jiao, Junge Zhang
Title: EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning
Abstract:
Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business negotiations, which require strategic reasoning-an ability to navigate dynamic environments and align long-term goals amidst uncertainty. Existing methods for strategic reasoning face challenges in adaptability, scalability, and transferring strategies to new contexts. To address these issues, we propose explicit policy optimization (EPO) for strategic reasoning, featuring an LLM that provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior. To improve adaptability and policy transferability, we train the strategic reasoning model via multi-turn reinforcement learning (RL),utilizing process rewards and iterative self-play. Experiments across social and physical domains demonstrate EPO's ability of long-term goal alignment through enhanced strategic reasoning, achieving state-of-the-art performance on social dialogue and web navigation tasks. Our findings reveal various collaborative reasoning mechanisms emergent in EPO and its effectiveness in generating novel strategies, underscoring its potential for strategic reasoning in real-world applications. Code and data are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/EPO.
中文摘要:提出的显式策略优化(EPO)方法通过多轮强化学习增强大语言模型的战略推理能力,使其在社交对话和网页导航等复杂现实场景中展现出卓越的适应性和顶尖性能。
English Summary: The proposed Explicit Policy Optimization (EPO) method enhances LLMs' strategic reasoning through multi-turn reinforcement learning, enabling superior adaptability and state-of-the-art performance in complex real-world scenarios like social dialogue and web navigation.

Authors:Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago S. F. X. Teixeira, Diyi Yang, Ke Wang, Alex Aiken
Title: EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking
Abstract:
As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench
中文: EquiBench作为通过等价性检查评估大语言模型程序语义理解能力的新基准,揭示了模型常依赖语法相似性而非深层语义推理的局限性。
English: EquiBench is a novel benchmark that assesses large language models' understanding of program semantics through equivalence checking, revealing their limited reasoning capabilities as they often rely on syntactic cues rather than deep semantic analysis.

Authors:Lei Wang, Zheqing Zhang, Xu Chen
Title: Investigating and Extending Homans' Social Exchange Theory with Large Language Model based Agents
Abstract:
Homans' Social Exchange Theory (SET) is widely recognized as a basic framework for understanding the formation and emergence of human civilizations and social structures. In social science, this theory is typically studied based on simple simulation experiments or real-world human studies, both of which either lack realism or are too expensive to control. In artificial intelligence, recent advances in large language models (LLMs) have shown promising capabilities in simulating human behaviors. Inspired by these insights, we adopt an interdisciplinary research perspective and propose using LLM-based agents to study Homans' SET. Specifically, we construct a virtual society composed of three LLM agents and have them engage in a social exchange game to observe their behaviors. Through extensive experiments, we found that Homans' SET is well validated in our agent society, demonstrating the consistency between the agent and human behaviors. Building on this foundation, we intentionally alter the settings of the agent society to extend the traditional Homans' SET, making it more comprehensive and detailed. To the best of our knowledge, this paper marks the first step in studying Homans' SET with LLM-based agents. More importantly, it introduces a novel and feasible research paradigm that bridges the fields of social science and computer science through LLM-based agents. Code is available at https://github.com/Paitesanshi/SET.
中文摘要:本研究首次采用大语言模型智能体验证并拓展了霍曼斯社会交换理论,不仅证明了智能体与人类行为的一致性,更开创了连接社会科学与计算机科学的新型跨学科研究范式。
English Summary: This study pioneers the use of large language model (LLM) agents to validate and extend Homans' Social Exchange Theory, demonstrating behavioral consistency between artificial agents and humans while establishing a novel interdisciplinary research paradigm.

Authors:Gang Yang, Miao Wang, Quan Zhou, Jiangchuan Li
Title: YUNet: Improved YOLOv11 Network for Skyline Detection
Abstract:
Skyline detection plays an important role in geolocalizaion, flight control, visual navigation, port security, etc. The appearance of the sky and non-sky areas are variable, because of different weather or illumination environment, which brings challenges to skyline detection. In this research, we proposed the YUNet algorithm, which improved the YOLOv11 architecture to segment the sky region and extract the skyline in complicated and variable circumstances. To improve the ability of multi-scale and large range contextual feature fusion, the YOLOv11 architecture is extended as an UNet-like architecture, consisting of an encoder, neck and decoder submodule. The encoder extracts the multi-scale features from the given images. The neck makes fusion of these multi-scale features. The decoder applies the fused features to complete the prediction rebuilding. To validate the proposed approach, the YUNet was tested on Skyfinder and CH1 datasets for segmentation and skyline detection respectively. Our test shows that the IoU of YUnet segmentation can reach 0.9858, and the average error of YUnet skyline detection is just 1.36 pixels. The implementation is published at https://github.com/kuazhangxiaoai/SkylineDet-YOLOv11Seg.git.
中文: YUNet算法基于YOLOv11改进,采用类似UNet的结构,在复杂多变环境下能有效分割天空区域并提取天际线,分割精度高且检测误差极小。
English: The YUNet algorithm, an enhanced version of YOLOv11 with a UNet-like structure, effectively segments sky regions and detects skylines in diverse conditions, achieving high accuracy in segmentation and minimal error in detection.

Authors:Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Title: Multi-Attribute Steering of Language Models via Targeted Intervention
Abstract:
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).
中文: MAT-Steer是一种新颖的引导框架,通过稀疏正交的引导向量对多属性进行选择性干预以减少冲突,在问答和生成任务中均优于现有方法。
English: MAT-Steer is a novel framework that enables selective intervention on multiple attributes in large language models by learning sparse, orthogonal steering vectors to reduce conflicts, outperforming existing methods in both QA and generative tasks.

Authors:Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Muñoz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah
Title: SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs
Abstract:
Large language models have high compute, latency, and memory requirements. While specialized accelerators such as GPUs and TPUs typically run these workloads, CPUs are more widely available and consume less energy. Accelerating LLMs with CPUs enables broader AI access at a lower cost and power consumption. This acceleration potential for CPUs is especially relevant during the memory-bound decoding stage of LLM inference, which processes one token at a time and is becoming increasingly utilized with reasoning models. We utilize Advanced Matrix Extensions (AMX) support on the latest Intel CPUs together with unstructured sparsity to achieve a $1.42 \times$ reduction in end-to-end latency compared to the current PyTorch implementation by applying our technique in linear layers. We provide a set of open-source customized sparse kernels that can speed up any PyTorch model by automatically replacing all linear layers with our custom sparse implementation. Furthermore, we demonstrate for the first time the use of unstructured sparsity in the attention computation achieving a $1.14 \times$ speedup over the current systems without compromising accuracy. Code: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SparAMX
中文: 本研究利用英特尔CPU的高级矩阵扩展和非结构化稀疏技术,在线性层和注意力计算中均实现了延迟显著降低,同时保持模型精度。
English: This work accelerates large language models on Intel CPUs using Advanced Matrix Extensions and unstructured sparsity, achieving significant latency reductions in both linear and attention layers while maintaining accuracy.

Authors:Jiaqi Wang, Yuhang Zhou, Zhixiong Zhang, Qiguang Chen, Yongqiang Chen, James Cheng
Title: DivIL: Unveiling and Addressing Over-Invariance for Out-of- Distribution Generalization
Abstract:
Out-of-distribution generalization is a common problem that expects the model to perform well in the different distributions even far from the train data. A popular approach to addressing this issue is invariant learning (IL), in which the model is compiled to focus on invariant features instead of spurious features by adding strong constraints during training. However, there are some potential pitfalls of strong invariant constraints. Due to the limited number of diverse environments and over-regularization in the feature space, it may lead to a loss of important details in the invariant features while alleviating the spurious correlations, namely the over-invariance, which can also degrade the generalization performance. We theoretically define the over-invariance and observe that this issue occurs in various classic IL methods. To alleviate this issue, we propose a simple approach Diverse Invariant Learning (DivIL) by adding the unsupervised contrastive learning and the random masking mechanism compensatory for the invariant constraints, which can be applied to various IL methods. Furthermore, we conduct experiments across multiple modalities across 12 datasets and 6 classic models, verifying our over-invariance insight and the effectiveness of our DivIL framework. Our code is available at https://github.com/kokolerk/DivIL.
中文: 该研究指出过度不变性是导致重要特征细节丢失并削弱泛化性能的潜在问题,提出了通过对比学习和随机掩码机制来补偿约束的DivIL框架,并在多个数据集和模型上验证了其有效性。
English: The study identifies over-invariance as a pitfall in invariant learning methods that can degrade generalization by causing loss of important feature details, and proposes the DivIL framework to mitigate this issue through contrastive learning and random masking, validated across multiple datasets and models.

Authors:Riting Xia, Huibo Liu, Anchen Li, Xueyan Liu, Yan Zhang, Chunxu Zhang, Bo Yang
Title: Incomplete Graph Learning: A Comprehensive Survey
Abstract:
Graph learning is a prevalent field that operates on ubiquitous graph data. Effective graph learning methods can extract valuable information from graphs. However, these methods are non-robust and affected by missing attributes in graphs, resulting in sub-optimal outcomes. This has led to the emergence of incomplete graph learning, which aims to process and learn from incomplete graphs to achieve more accurate and representative results. In this paper, we conducted a comprehensive review of the literature on incomplete graph learning. Initially, we categorize incomplete graphs and provide precise definitions of relevant concepts, terminologies, and techniques, thereby establishing a solid understanding for readers. Subsequently, we classify incomplete graph learning methods according to the types of incompleteness: (1) attribute-incomplete graph learning methods, (2) attribute-missing graph learning methods, and (3) hybrid-absent graph learning methods. By systematically classifying and summarizing incomplete graph learning methods, we highlight the commonalities and differences among existing approaches, aiding readers in selecting methods and laying the groundwork for further advancements. In addition, we summarize the datasets, incomplete processing modes, evaluation metrics, and application domains used by the current methods. Lastly, we discuss the current challenges and propose future directions for incomplete graph learning, with the aim of stimulating further innovations in this crucial field. To our knowledge, this is the first review dedicated to incomplete graph learning, aiming to offer valuable insights for researchers in related fields.We developed an online resource to follow relevant research based on this review, available at https://github.com/cherry-a11y/Incomplete-graph-learning.git
中文摘要:本文首次系统综述了不完整图学习领域,通过分类方法、总结应用并展望未来方向,为该领域研究提供重要参考。
English Summary: This paper presents the first comprehensive review of incomplete graph learning, categorizing methods by incompleteness types and discussing challenges to advance this emerging field.

Authors:Batu El, Deepro Choudhury, Pietro Liò, Chaitanya K. Joshi
Title: Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs
Abstract:
We introduce Attention Graphs, a new tool for mechanistic interpretability of Graph Neural Networks (GNNs) and Graph Transformers based on the mathematical equivalence between message passing in GNNs and the self-attention mechanism in Transformers. Attention Graphs aggregate attention matrices across Transformer layers and heads to describe how information flows among input nodes. Through experiments on homophilous and heterophilous node classification tasks, we analyze Attention Graphs from a network science perspective and find that: (1) When Graph Transformers are allowed to learn the optimal graph structure using all-to-all attention among input nodes, the Attention Graphs learned by the model do not tend to correlate with the input/original graph structure; and (2) For heterophilous graphs, different Graph Transformer variants can achieve similar performance while utilising distinct information flow patterns. Open source code: https://github.com/batu-el/understanding-inductive-biases-of-gnns
中文:我们引入了注意力图作为图神经网络和图变换器的机制解释工具,通过聚合注意力矩阵分析信息流动,发现学习到的图结构与输入图相关性弱,且在异配性图中不同变换器变体可通过不同信息流模式实现相近性能。
English: Attention Graphs are introduced as a tool for interpreting Graph Neural Networks and Graph Transformers by analyzing aggregated attention matrices to reveal information flow, with findings showing learned structures often diverge from input graphs and diverse patterns yield similar performance in heterophilous tasks.

Authors:Petar Steinberg, Juliusz Ziomek, Matej Jusup, Ilija Bogunovic
Title: Mean-Field Bayesian Optimisation
Abstract:
We address the problem of optimising the average payoff for a large number of cooperating agents, where the payoff function is unknown and treated as a black box. While standard Bayesian Optimisation (BO) methods struggle with the scalability required for high-dimensional input spaces, we demonstrate how leveraging the mean-field assumption on the black-box function can transform BO into an efficient and scalable solution. Specifically, we introduce MF-GP-UCB, a novel efficient algorithm designed to optimise agent payoffs in this setting. Our theoretical analysis establishes a regret bound for MF-GP-UCB that is independent of the number of agents, contrasting sharply with the exponential dependence observed when naive BO methods are applied. We evaluate our algorithm on a diverse set of tasks, including real-world problems, such as optimising the location of public bikes for a bike-sharing programme, distributing taxi fleets, and selecting refuelling ports for maritime vessels. Empirical results demonstrate that MF-GP-UCB significantly outperforms existing benchmarks, offering substantial improvements in performance and scalability, constituting a promising solution for mean-field, black-box optimisation. The code is available at https://github.com/petarsteinberg/MF-BO.
中文: 我们提出MF-GP-UCB算法,通过利用平均场假设实现高维黑盒优化的可扩展性,在理论层面获得与智能体数量无关的遗憾界,并在共享单车调度等实际应用中显著超越现有基准方法。
English: We introduce MF-GP-UCB, a scalable Bayesian optimization algorithm that leverages mean-field assumptions to efficiently optimize agent payoffs in high-dimensional black-box settings, achieving regret bounds independent of agent count and outperforming benchmarks in real-world applications.

Authors:Yinghao Shuai, Ran Yu, Yuantao Chen, Zijian Jiang, Xiaowei Song, Nan Wang, Jv Zheng, Jianzhu Ma, Meng Yang, Zhicheng Wang, Wenbo Ding, Hao Zhao
Title: PUGS: Zero-shot Physical Understanding with Gaussian Splatting
Abstract:
Current robotic systems can understand the categories and poses of objects well. But understanding physical properties like mass, friction, and hardness, in the wild, remains challenging. We propose a new method that reconstructs 3D objects using the Gaussian splatting representation and predicts various physical properties in a zero-shot manner. We propose two techniques during the reconstruction phase: a geometry-aware regularization loss function to improve the shape quality and a region-aware feature contrastive loss function to promote region affinity. Two other new techniques are designed during inference: a feature-based property propagation module and a volume integration module tailored for the Gaussian representation. Our framework is named as zero-shot physical understanding with Gaussian splatting, or PUGS. PUGS achieves new state-of-the-art results on the standard benchmark of ABO-500 mass prediction. We provide extensive quantitative ablations and qualitative visualization to demonstrate the mechanism of our designs. We show the proposed methodology can help address challenging real-world grasping tasks. Our codes, data, and models are available at https://github.com/EverNorif/PUGS
中文:PUGS框架采用高斯溅射表示重建三维物体,并以零样本方式预测质量、摩擦等物理属性,在标准基准测试中创下最新记录,同时有效提升了机器人抓取任务的现实应用能力。
English: The proposed PUGS framework uses Gaussian splatting to reconstruct 3D objects and predict physical properties like mass and friction in a zero-shot manner, achieving state-of-the-art results and enhancing real-world robotic grasping tasks.

Authors:Jake Vasilakes, Chrysoula Zerva, Sophia Ananiadou
Title: Subjective Logic Encodings
Abstract:
Many existing approaches for learning from labeled data assume the existence of gold-standard labels. According to these approaches, inter-annotator disagreement is seen as noise to be removed, either through refinement of annotation guidelines, label adjudication, or label filtering. However, annotator disagreement can rarely be totally eradicated, especially on more subjective tasks such as sentiment analysis or hate speech detection where disagreement is natural. Therefore, a new approach to learning from labeled data, called data perspectivism, seeks to leverage inter-annotator disagreement to learn models that stay true to the inherent uncertainty of the task by treating annotations as opinions of the annotators, rather than gold-standard facts. Despite this conceptual grounding, existing methods under data perspectivism are limited to using disagreement as the sole source of annotation uncertainty. To expand the possibilities of data perspectivism, we introduce Subjective Logic Encodings (SLEs), a flexible framework for constructing classification targets that explicitly encodes annotations as opinions of the annotators. Based on Subjective Logic Theory, SLEs encode labels as Dirichlet distributions and provide principled methods for encoding and aggregating various types of annotation uncertainty -- annotator confidence, reliability, and disagreement -- into the targets. We show that SLEs are a generalization of other types of label encodings as well as how to estimate models to predict SLEs using a distribution matching objective.
中文: 数据透视主义将标注者分歧视为有价值信息而非噪声,而提出的主观逻辑编码框架通过将多种标注不确定性纳入分类目标,扩展了这一方法。
English: Data perspectivism treats annotator disagreement as valuable information rather than noise, and the proposed Subjective Logic Encodings framework expands this approach by incorporating multiple sources of annotation uncertainty into classification targets.

Authors:Jiayu Zhang, Zhiyu Zhu, Xinyi Wang, Silin Liao, Zhibo Jin, Flora D. Salim, Huaming Chen
Title: PAR-AdvGAN: Improving Adversarial Attack Capability with Progressive Auto-Regression AdvGAN
Abstract:
Deep neural networks have demonstrated remarkable performance across various domains. However, they are vulnerable to adversarial examples, which can lead to erroneous predictions. Generative Adversarial Networks (GANs) can leverage the generators and discriminators model to quickly produce high-quality adversarial examples. Since both modules train in a competitive and simultaneous manner, GAN-based algorithms like AdvGAN can generate adversarial examples with better transferability compared to traditional methods. However, the generation of perturbations is usually limited to a single iteration, preventing these examples from fully exploiting the potential of the methods. To tackle this issue, we introduce a novel approach named Progressive Auto-Regression AdvGAN (PAR-AdvGAN). It incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability. We thoroughly evaluate our PAR-AdvGAN method with a large-scale experiment, demonstrating its superior performance over various state-of-the-art black-box adversarial attacks, as well as the original AdvGAN.Moreover, PAR-AdvGAN significantly accelerates the adversarial example generation, i.e., achieving the speeds of up to 335.5 frames per second on Inception-v3 model, outperforming the gradient-based transferable attack algorithms. Our code is available at: https://github.com/LMBTough/PAR
Chinese: 提出的PAR-AdvGAN方法在渐进式生成网络中引入自回归迭代机制,相比现有方法能生成攻击能力更强、速度更快的对抗样本。
English: The proposed PAR-AdvGAN method introduces an auto-regressive iteration mechanism within a progressive generation network to create adversarial examples with superior attack capability and faster generation speed compared to existing approaches.

Authors:Norman Mu, Jonathan Lu, Michael Lavery, David Wagner
Title: A Closer Look at System Prompt Robustness
Abstract:
System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI's GPT Store and HuggingFace's HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved with realistic fine-tuning data, as well as inference-time interventions such as classifier-free guidance. Finally, we analyze the results of recently released reasoning models from OpenAI and DeepSeek, which show exciting but uneven improvements on the benchmarks we study. Overall, current techniques fall short of ensuring system prompt robustness and further study is warranted.
中文: 系统提示对于控制大语言模型行为至关重要,但模型在面临冲突输入时常常无法遵循,研究通过微调和推理干预提升了鲁棒性,虽取得进展但效果仍不稳定。
English: System prompts are essential for controlling LLM behavior, yet models often fail to adhere to them under conflicting inputs, prompting research into improved robustness through fine-tuning and inference interventions that show promising but inconsistent results.

Authors:Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan
Title: MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Abstract:
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .
中文:MUDD连接提出了一种动态方法,通过生成位置特定和输入流依赖的权重来增强Transformer中的跨层信息流,以极少的参数和计算量显著提升了语言建模性能。
English: MUDD connections introduce a dynamic method to enhance cross-layer information flow in Transformers by generating position-specific and input-stream-dependent weights, significantly improving performance in language modeling with minimal added parameters and computation.

Authors:Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo
Title: Diffusion Models without Classifier-free Guidance
Abstract:
This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models. Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG. Extensive experiments demonstrate the effectiveness, efficiency, scalability on different models and datasets. Finally, we establish state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34. Our code is available at https://github.com/tzco/Diffusion-wo-CFG.
中文: 本文提出模型引导(MG)这一创新目标,通过引入条件后验概率取代了常用的无分类器引导(CFG),在加速训练和推理的同时实现了更优的图像生成质量,并在ImageNet 256基准测试中取得了当前最佳性能。
English: This paper introduces Model-guidance (MG), a novel training objective for diffusion models that replaces Classifier-free guidance (CFG) by incorporating conditional posterior probabilities, achieving faster training, doubled inference speed, and superior image quality with state-of-the-art results on benchmarks.

Authors:Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu
Title: Idiosyncrasies in Large Language Models
Abstract:
In this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the broader implications of our findings, including training on synthetic data, inferring model similarity, and robust evaluation of LLMs. Code is available at https://github.com/locuslab/llm-idiosyncrasies.
中文: 本研究揭示了大型语言模型输出中的独特模式,能够准确识别生成来源,并发现这些特征在文本改写后依然存在且编码于语义内容中。
English: This study identifies unique patterns in Large Language Models' outputs that enable accurate source model classification, revealing these idiosyncrasies persist through text modifications and are embedded in semantic content.

Authors:Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui
Title: HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
Abstract:
The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow
Chinese: 本研究提出了HermesFlow框架,通过同源偏好数据和迭代优化,有效弥合了多模态大语言模型中理解与生成能力之间的差距,在统一这两种能力方面展现出卓越性能。
English: The study introduces HermesFlow, a framework that effectively bridges the gap between understanding and generation in Multimodal Large Language Models using homologous preference data and iterative optimization, demonstrating superior performance in aligning these capabilities.

Authors:Ye Tian, Ling Yang, Xinchen Zhang, Yunhai Tong, Mengdi Wang, Bin Cui
Title: Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening
Abstract:
We propose Diffusion-Sharpening, a fine-tuning approach that enhances downstream alignment by optimizing sampling trajectories. Existing RL-based fine-tuning methods focus on single training timesteps and neglect trajectory-level alignment, while recent sampling trajectory optimization methods incur significant inference NFE costs. Diffusion-Sharpening overcomes this by using a path integral framework to select optimal trajectories during training, leveraging reward feedback, and amortizing inference costs. Our method demonstrates superior training efficiency with faster convergence, and best inference efficiency without requiring additional NFEs. Extensive experiments show that Diffusion-Sharpening outperforms RL-based fine-tuning methods (e.g., Diffusion-DPO) and sampling trajectory optimization methods (e.g., Inference Scaling) across diverse metrics including text alignment, compositional capabilities, and human preferences, offering a scalable and efficient solution for future diffusion model fine-tuning. Code: https://github.com/Gen-Verse/Diffusion-Sharpening
中文: Diffusion-Sharpening 是一种基于路径积分框架优化采样轨迹的微调方法,在提升对齐效果和效率的同时,无需额外计算成本,显著优于现有方法。
English: Diffusion-Sharpening is a fine-tuning method that optimizes sampling trajectories using a path integral framework to enhance alignment and efficiency, outperforming existing approaches in both training convergence and inference without extra computational cost.

Authors:Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie
Title: Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to mitigate large language model (LLM) hallucinations by incorporating external knowledge retrieval. However, existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving when unnecessary or failing to retrieve iteratively when required for complex reasoning. Recent adaptive retrieval strategies, though adaptively navigates these retrieval strategies, predict only based on query complexity and lacks user-driven flexibility, making them infeasible for diverse user application needs. In this paper, we introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two classifiers: one trained to prioritize accuracy and another to prioritize retrieval efficiency. Via an interpretable control parameter $α$, users can seamlessly navigate between minimal-cost retrieval and high-accuracy retrieval based on their specific requirements. We empirically demonstrate that our approach effectively balances accuracy, retrieval cost, and user controllability, making it a practical and adaptable solution for real-world applications. Code is available at https://github.com/JinyanSu1/Flare-Aug.
中文: 本文提出了一种用户可控的RAG框架,通过可解释参数动态平衡精度与检索成本,为实际应用提供了灵活的解决方案。
English: This paper introduces a user-controllable RAG framework that dynamically balances accuracy and retrieval costs through an interpretable parameter, offering a practical solution for real-world applications.

Authors:Robert Reischke
Title: pylevin: efficient numerical integration of integrals containing up to three Bessel functions
Abstract:
Integrals involving highly oscillatory Bessel functions are notoriously challenging to compute using conventional integration techniques. While several methods are available, they predominantly cater to integrals with at most a single Bessel function, resulting in specialised yet highly optimised solutions. Here we present pylevin, a Python package to efficiently compute integrals containing up to three Bessel functions of arbitrary order and arguments. The implementation makes use of Levin's method and allows for accurate and fast integration of these highly oscillatory integrals. In benchmarking pylevin against existing software for single Bessel function integrals, we find its speed comparable, usually within a factor of two, to specialised packages such as FFTLog. Furthermore, when dealing with integrals containing two or three Bessel functions, pylevin delivers performance up to four orders of magnitude faster than standard adaptive quadrature methods, while also exhibiting better stability for large Bessel function arguments. pylevin is available from source via github or directly from PyPi.
中文:pylevin Python 包采用莱文方法高效计算包含最多三个贝塞尔函数的高振荡积分,在单函数积分中与专用工具速度相当,在多函数积分中比标准方法快上万倍且稳定性更优。
English: The pylevin Python package efficiently computes highly oscillatory integrals with up to three Bessel functions using Levin's method, achieving comparable speed to specialized tools for single functions and dramatically outperforming standard methods for multiple functions.

Authors:Sayantan Adak, Pauras Mangesh Meher, Paramita Das, Animesh Mukherjee
Title: REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives
Abstract:
Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia's B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique -- REVerSum -- we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVerSum generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5\% in terms of informativeness. Code and Data are available at: https://github.com/sayantan11995/wikipedia_enrichment
中文: 本研究提出REVerSum方法,通过整合个人叙事来丰富维基百科B类和C类人物传记条目,使其可整合性提升17%,信息量增加28.5%,显著优于现有基线。
English: This study introduces REVerSum, a retrieval-augmented generation method that enhances Wikipedia's B and C category biography articles by incorporating personal narratives, significantly improving their integrability by 17% and informativeness by 28.5% over baselines.

Authors:Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Title: SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Abstract:
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks by generating intermediate reasoning steps. However, most existing approaches focus on hard token decoding, which constrains reasoning within the discrete vocabulary space and may not always be optimal. While recent efforts explore continuous-space reasoning, they often require full-model fine-tuning and suffer from catastrophic forgetting, limiting their applicability to state-of-the-art LLMs that already perform well in zero-shot settings with a proper instruction. To address this challenge, we propose a novel approach for continuous-space reasoning that does not require modifying the LLM. Specifically, we employ a lightweight fixed assistant model to speculatively generate instance-specific soft thought tokens as the initial chain of thoughts, which are then mapped into the LLM's representation space via a trainable projection module. Experimental results on five reasoning benchmarks demonstrate that our method enhances LLM reasoning performance through supervised, parameter-efficient fine-tuning. Source code is available at https://github.com/xuyige/SoftCoT.
中文: 提出的SoftCoT方法通过轻量级助手生成连续的软思考标记,并将其映射到大型语言模型的表示空间中,无需修改模型即可通过高效微调提升推理性能。
English: The proposed SoftCoT method enhances large language models' reasoning by using a lightweight assistant to generate continuous soft thought tokens, which are then projected into the model's space for efficient fine-tuning without full-model modifications.

Authors:Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
Title: LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities
Abstract:
Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .
中文: LaM-SLidE通过引入标识符表示,在潜在空间中实现可追踪的实体建模,弥合了动态系统需求与图像/视频生成技术效率之间的差距,并在多个领域展现出卓越的速度、准确性和泛化能力。
English: LaM-SLidE introduces identifier representations to enable traceable entity modeling in latent space, bridging the gap between dynamical system requirements and the efficiency of image/video generation techniques while demonstrating superior speed, accuracy, and generalizability across domains.

Authors:Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke
Title: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
Abstract:
We introduce SWE-Lancer, a benchmark of over 1,400 freelance software engineering tasks from Upwork, valued at \$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks--ranging from \$50 bug fixes to \$32,000 feature implementations--and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split, SWE-Lancer Diamond (https://github.com/openai/SWELancer-Benchmark). By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.
中文: SWE-Lancer是一个包含1400多个自由职业软件工程任务的基准,总价值100万美元,用于评估AI模型在技术和管理方面的能力,目前模型仍难以解决大部分任务,并开源资源以促进未来研究。
English: SWE-Lancer is a benchmark of over 1,400 freelance software engineering tasks valued at $1 million, evaluating both technical and managerial capabilities of AI models, which currently struggle to solve most tasks, with open-source resources provided for future research.

Authors:Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, Yongfeng Zhang
Title: A-MEM: Agentic Memory for LLM Agents
Abstract:
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A-mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A-mem-sys.
中文: 本文提出了一种基于Zettelkasten方法的智能记忆系统,通过动态索引和链接构建互联知识网络,使LLM代理能够实现记忆的持续演进,在实验中展现出优于现有方法的性能。
English: This paper introduces an agentic memory system for LLM agents that dynamically organizes and interconnects memories using Zettelkasten principles, enabling continuous evolution and superior performance over existing methods.

Authors:Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
Title: A-MEM: Agentic Memory for LLM Agents
Abstract:
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A-mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A-mem-sys.
中文: 本文提出了一种基于Zettelkasten方法的智能记忆系统,通过动态索引和链接构建互联知识网络,使LLM代理能够实现记忆的持续演进,在实验中展现出优于现有方法的性能。
English: This paper introduces an agentic memory system for LLM agents that dynamically organizes and interconnects memories using Zettelkasten principles, enabling continuous evolution and superior performance over existing methods.

Authors:Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
Title: APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
Abstract:
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.
中文: APB是一种高效的长上下文推理框架,通过多主机近似注意力和优化并行机制显著提升预填充速度,在保持任务性能的同时实现高达9.2倍的加速效果。
English: APB is an efficient long-context inference framework that accelerates prefill speed through multi-host approximate attention and optimized parallelism, achieving up to 9.2x faster inference without performance loss.

Authors:Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, Wenjie Li
Title: TokenSkip: Controllable Chain-of-Thought Compression in LLMs
Abstract:
Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop. We release our code and checkpoints in https://github.com/hemingkx/TokenSkip.
中文: TokenSkip 是一种创新方法,通过选择性跳过推理链中重要性较低的标记来实现可控压缩,在保持各类大语言模型任务性能的同时显著降低推理延迟。
English: TokenSkip is an innovative method that selectively compresses Chain-of-Thought sequences by skipping less important tokens, significantly reducing inference latency while maintaining reasoning performance across various LLMs and tasks.

Authors:Jiayang Zhang, Xianyuan Liu, Wei Wu, Sina Tabakhi, Wenrui Fan, Shuo Zhou, Kang Lan Tee, Tuck Seng Wong, Haiping Lu
Title: Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning
Abstract:
Virus-like particles (VLPs) are valuable for vaccine development due to their immune-triggering properties. Understanding their stoichiometry, the number of protein subunits to form a VLP, is critical for vaccine optimisation. However, current experimental methods to determine stoichiometry are time-consuming and require highly purified proteins. To efficiently classify stoichiometry classes in proteins, we curate a new dataset and propose an interpretable, data-driven pipeline leveraging linear machine learning models. We also explore the impact of feature encoding on model performance and interpretability, as well as methods to identify key protein sequence features influencing classification. The evaluation of our pipeline demonstrates that it can classify stoichiometry while revealing protein features that possibly influence VLP assembly. The data and code used in this work are publicly available at https://github.com/Shef-AIRE/StoicIML.
Chinese: 本研究提出了一种基于可解释机器学习的流程,用于高效分类病毒样颗粒(VLPs)的化学计量比,这对疫苗优化至关重要,同时识别出影响VLPs组装的关键蛋白质特征。
English: This study introduces an interpretable, machine learning-based pipeline for efficiently classifying the stoichiometry of virus-like particles (VLPs), which is crucial for vaccine optimization, while also identifying key protein features influencing VLP assembly.

Authors:Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li
Title: KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in various complex tasks, yet they still suffer from hallucinations. By incorporating and exploring external knowledge, such as knowledge graphs(KGs), LLM's ability to provide factual answers has been enhanced. This approach carries significant practical implications. However, existing methods suffer from three key limitations: insufficient mining of LLMs' internal knowledge, constrained generation of interpretable reasoning paths, and unclear fusion of internal and external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large model framework driven by the collaboration of internal and external knowledge. It relies on the internal knowledge of the LLM to guide the exploration of interpretable directed subgraphs in external knowledge graphs, better integrating the two knowledge sources for more accurate reasoning. Extensive experiments on multiple real-world datasets demonstrate the effectiveness of KnowPath. Our code and data are available at https://github.com/tize-72/KnowPath.
Chinese Summary: 通过整合外部知识图谱,大语言模型能够减少幻觉并提高事实准确性,KnowPath框架通过有效结合内外知识实现更优推理,实验证明了其有效性。
English Summary: Large language models can be enhanced by integrating external knowledge graphs to reduce hallucinations and improve factual accuracy, as demonstrated by the proposed KnowPath framework which effectively combines internal and external knowledge for better reasoning.

Authors:Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
Title: Atom of Thoughts for Markov LLM Test-Time Scaling
Abstract:
Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning can be achieved by solving a series of independent and self-contained subquestions. These subquestions are essentially \textit{atomic questions}, exhibiting the memoryless property similar to Markov processes. Based on this observation, we propose Atom of Thoughts (\our), where each state transition consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a simplified question that maintains answer equivalence with the original problem. This answer preservation enables the iterative \textit{decomposition-contraction} process to naturally form a meaningful Markov reasoning process. Furthermore, these atomic states can be seamlessly integrated into existing test-time scaling methods, enabling \our to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of \our both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, \our achieves an \textbf{80.6\%} F1 score, surpassing o3-mini by \textbf{3.4\%} and DeepSeek-R1 by \textbf{10.6\%}. The code is available at \href{https://github.com/qixucen/atom}{https://github.com/qixucen/atom}.
Chinese: 提出的Atom of Thoughts (AoT)方法通过将复杂问题分解为原子子问题并压缩为简化形式,有效提升大语言模型的推理能力,在多个基准测试中作为独立框架和插件增强均表现出卓越性能。
English: The proposed Atom of Thoughts (AoT) method enhances reasoning in Large Language Models by decomposing complex questions into atomic subquestions and contracting them into simplified forms, achieving superior performance across benchmarks as both a standalone framework and plug-in enhancement.

Authors:Yinan Chen, Jiangning Zhang, Yali Bi, Xiaobin Hu, Teng Hu, Zhucun Xue, Ran Yi, Yong Liu, Ying Tai
Title: Image Inversion: A Survey from GANs to Diffusion and Beyond
Abstract:
Image inversion is a fundamental task in generative models, aiming to map images back to their latent representations to enable downstream applications such as editing, restoration, and style transfer. This paper provides a comprehensive review of the latest advancements in image inversion techniques, focusing on two main paradigms: Generative Adversarial Network (GAN) inversion and diffusion model inversion. We categorize these techniques based on their optimization methods. For GAN inversion, we systematically classify existing methods into encoder-based approaches, latent optimization approaches, and hybrid approaches, analyzing their theoretical foundations, technical innovations, and practical trade-offs. For diffusion model inversion, we explore training-free strategies, fine-tuning methods, and the design of additional trainable modules, highlighting their unique advantages and limitations. Additionally, we discuss several popular downstream applications and emerging applications beyond image tasks, identifying current challenges and future research directions. By synthesizing the latest developments, this paper aims to provide researchers and practitioners with a valuable reference resource, promoting further advancements in the field of image inversion. We keep track of the latest works at https://github.com/RyanChenYN/ImageInversion
中文: 本文系统综述了图像反转技术的最新进展,重点对GAN反转和扩散模型反转的方法进行分类分析,探讨其应用与挑战,旨在为研究者提供全面的参考资源。
English: This paper offers a comprehensive review of image inversion techniques, categorizing GAN and diffusion model methods by optimization approaches and analyzing their applications, challenges, and future directions to serve as a key reference for researchers.

Authors:Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Hongyuan Wang, Kang An, Wei Ji, Wen Li, Xuan Wen, Xiangwen Kong, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Junjing Guo, Jiashuai Liu, Jiahong Liu, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Liang Zhao, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingliang Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Ran Sun, Shuai Shuai, Shaoliang Pang, Shiliang Yang, Shuli Gao, Shanshan Yuan, Siqi Liu, Shihong Deng, Shilei Jiang, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wuxun Xie, Weipeng Ming, Wenqing He, Wen Sun, Xin Han, Xin Huang, Xiaomin Deng, Xiaojia Liu, Xin Wu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaoyu Wang, Yaqiang Shi, Yilei Wang, Yizhuang Zhou, Yinmin Zhong, Yang Zhang, Yaoben Wei, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuchu Luo, Yuanhao Ding, Yuting Yan, Yaqi Dai, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zhisheng Guan, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu
Title: Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
中文: 本文提出首个生产就绪的开源方案Step-Audio,通过1300亿参数统一语音文本模型、生成式数据引擎、动态控制系统和增强认知架构,在人工评估中实现最优性能,显著提升开源多模态技术发展。
English: This paper introduces Step-Audio, a production-ready open-source solution featuring a unified 130B-parameter speech-text model, generative data engine, dynamic control system, and enhanced cognitive architecture that achieves state-of-the-art performance in human evaluations.

Authors:Theresia Veronika Rampisela, Tuukka Ruotsalo, Maria Maistro, Christina Lioma
Title: Joint Evaluation of Fairness and Relevance in Recommender Systems with Pareto Frontier
Abstract:
Fairness and relevance are two important aspects of recommender systems (RSs). Typically, they are evaluated either (i) separately by individual measures of fairness and relevance, or (ii) jointly using a single measure that accounts for fairness with respect to relevance. However, approach (i) often does not provide a reliable joint estimate of the goodness of the models, as it has two different best models: one for fairness and another for relevance. Approach (ii) is also problematic because these measures tend to be ad-hoc and do not relate well to traditional relevance measures, like NDCG. Motivated by this, we present a new approach for jointly evaluating fairness and relevance in RSs: Distance to Pareto Frontier (DPFR). Given some user-item interaction data, we compute their Pareto frontier for a pair of existing relevance and fairness measures, and then use the distance from the frontier as a measure of the jointly achievable fairness and relevance. Our approach is modular and intuitive as it can be computed with existing measures. Experiments with 4 RS models, 3 re-ranking strategies, and 6 datasets show that existing metrics have inconsistent associations with our Pareto-optimal solution, making DPFR a more robust and theoretically well-founded joint measure for assessing fairness and relevance. Our code: https://github.com/theresiavr/DPFR-recsys-evaluation
中文摘要:本文提出了一种名为“至帕累托前沿距离”的新联合评估方法,通过计算与现有公平性和相关性指标的帕累托前沿距离,为推荐系统提供了更稳健的双重评估框架。
English Summary: This paper introduces a novel joint evaluation method called Distance to Pareto Frontier (DPFR) that robustly measures both fairness and relevance in recommender systems by calculating proximity to their Pareto frontier using existing metrics.

Authors:Xuefeng Li, Haoyang Zou, Pengfei Liu
Title: LIMR: Less is More for RL Scaling
Abstract:
In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing language models' reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly underperforms at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https://github.com/GAIR-NLP/LIMR.
中文: 本研究表明,通过"学习影响度量"方法进行策略性样本选择比单纯扩大数据规模更能提升语言模型的推理能力,仅用1,389个样本就超越了完整数据集的性能表现。
English: This study reveals that strategic sample selection through Learning Impact Measurement (LIM) is more crucial than data volume for enhancing language models' reasoning, achieving superior performance with only 1,389 samples compared to full datasets.

Authors:Chen Xu, Zhirui Deng, Clara Rus, Xiaopeng Ye, Yuanna Liu, Jun Xu, Zhicheng Dou, Ji-Rong Wen, Maarten de Rijke
Title: FairDiverse: A Comprehensive Toolkit for Fair and Diverse Information Retrieval Algorithms
Abstract:
In modern information retrieval (IR). achieving more than just accuracy is essential to sustaining a healthy ecosystem, especially when addressing fairness and diversity considerations. To meet these needs, various datasets, algorithms, and evaluation frameworks have been introduced. However, these algorithms are often tested across diverse metrics, datasets, and experimental setups, leading to inconsistencies and difficulties in direct comparisons. This highlights the need for a comprehensive IR toolkit that enables standardized evaluation of fairness- and diversity-aware algorithms across different IR tasks. To address this challenge, we present FairDiverse, an open-source and standardized toolkit. FairDiverse offers a framework for integrating fair and diverse methods, including pre-processing, in-processing, and post-processing techniques, at different stages of the IR pipeline. The toolkit supports the evaluation of 28 fairness and diversity algorithms across 16 base models, covering two core IR tasks (search and recommendation) thereby establishing a comprehensive benchmark. Moreover, FairDiverse is highly extensible, providing multiple APIs that empower IR researchers to swiftly develop and evaluate their own fairness and diversity aware models, while ensuring fair comparisons with existing baselines. The project is open-sourced and available on https://github.com/XuChen0427/FairDiverse.
中文: 该摘要介绍了FairDiverse这一开源工具包,旨在标准化信息检索中公平性与多样性算法的评估,解决现有测试方法的不一致问题,并支持多种检索任务。
English: This abstract introduces FairDiverse, an open-source toolkit designed to standardize the evaluation of fairness and diversity algorithms in information retrieval, addressing inconsistencies in current testing methods and supporting multiple IR tasks.

Authors:Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
Title: Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration
Abstract:
Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. DPT-Agent can effectively help LLMs convert correct slow thinking and reasoning into executable actions, thereby improving performance. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
中文摘要:DPT-Agent是一种新型语言智能体框架,通过整合快速反应的系统1和基于推理的系统2,实现了自主的实时人机协作,克服了当前基于大语言模型的智能体在延迟和策略推断方面的局限。
English Summary: DPT-Agent is a novel language agent framework that integrates fast System 1 and reasoning-based System 2 processes to enable autonomous real-time human-AI collaboration, overcoming latency and strategy inference limitations of current LLM-based agents.

Authors:Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
Title: Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
Abstract:
The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
中文摘要:Bitnet.cpp为三元大语言模型推出了优化的推理系统,采用创新的混合精度矩阵乘法库,在实现无损边缘部署的同时,相比全精度基线最高可提升6.25倍推理速度。
English Summary: Bitnet.cpp introduces an optimized inference system for ternary large language models, featuring a novel mixed-precision matrix multiplication library that achieves up to 6.25x speed improvement over full-precision baselines while enabling lossless edge deployment.

Authors:Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu
Title: Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities
Abstract:
This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at https://github.com/wanghanbinpanda/CodeVision.
中文: 本文介绍了Code-Vision基准,通过要求多模态大语言模型根据流程图生成功能程序来评估其逻辑理解和代码生成能力,实验显示专有模型(如GPT-4o)在复杂任务上显著优于开源模型,性能差距悬殊。
English: This paper presents Code-Vision, a benchmark for assessing MLLMs' logical reasoning and code generation by requiring them to produce functional programs from flowcharts, revealing a significant performance gap where proprietary models like GPT-4o vastly outperform open-source ones, especially on complex tasks.

Authors:Weilin Lin, Nanjun Zhou, Yanyun Wang, Jianze Li, Hui Xiong, Li Liu
Title: BackdoorDM: A Comprehensive Benchmark for Backdoor Learning on Diffusion Model
Abstract:
Backdoor learning is a critical research topic for understanding the vulnerabilities of deep neural networks. While the diffusion model (DM) has been broadly deployed in public over the past few years, the understanding of its backdoor vulnerability is still in its infancy compared to the extensive studies in discriminative models. Recently, many different backdoor attack and defense methods have been proposed for DMs, but a comprehensive benchmark for backdoor learning on DMs is still lacking. This absence makes it difficult to conduct fair comparisons and thorough evaluations of the existing approaches, thus hindering future research progress. To address this issue, we propose \textit{BackdoorDM}, the first comprehensive benchmark designed for backdoor learning on DMs. It comprises nine state-of-the-art (SOTA) attack methods, four SOTA defense strategies, and three useful visualization analysis tools. We first systematically classify and formulate the existing literature in a unified framework, focusing on three different backdoor attack types and five backdoor target types, which are restricted to a single type in discriminative models. Then, we systematically summarize the evaluation metrics for each type and propose a unified backdoor evaluation method based on multimodal large language model (MLLM). Finally, we conduct a comprehensive evaluation and highlight several important conclusions. We believe that BackdoorDM will help overcome current barriers and contribute to building a trustworthy artificial intelligence generated content (AIGC) community. The codes are released in https://github.com/linweiii/BackdoorDM.
中文: 本文提出了首个针对扩散模型后门学习的综合基准BackdoorDM,整合了多种攻击防御方法与评估工具,旨在推动可信人工智能生成内容的发展。
English: This paper introduces BackdoorDM, the first comprehensive benchmark for backdoor learning in diffusion models, integrating multiple attack and defense methods with evaluation tools to advance trustworthy AI-generated content.

Authors:Xuan Ren, Qi Chen, Lingqiao Liu
Title: Efficient Response Generation Strategy Selection for Fine-Tuning Large Language Models Through Self-Aligned Perplexity
Abstract:
Fine-tuning large language models (LLMs) typically relies on producing large sets of input-output pairs. Yet for a given question, there can be many valid outputs. In practice, these outputs are often derived by distilling knowledge from teacher models, and they can vary depending on the specific teacher model or prompting strategy employed. Recent findings show that how these training outputs are generated can significantly affect the performance of the fine-tuned model, raising an important question: how do we pick the best data generation method from among numerous possibilities? Rather than exhaustively training and evaluating on each candidate, this paper proposes a scalable approximate method that assesses a small subset of generated data to estimate its suitability for a specific target LLM. Our central idea is that effective outputs should be familiar to the target LLM. While previous work measures familiarity with perplexity, we find that perplexity might be suboptimal in characterizing familiarity through empirical analyses and practical observations. To address this, we introduce self-aligned perplexity, a novel metric capturing how closely candidate outputs adhere to the target LLM's own style and reasoning patterns. In this way, we can identify the most effective generation strategy on a small sample, then apply it to produce the complete training set. We demonstrate that training on data generated by the chosen method yields significant improvements across diverse reasoning-focused benchmarks, particularly in cases where different candidate methods lead to highly divergent training outcomes. Our implementation is publicly available at https://github.com/XuanRen4470/SPPL.
中文摘要:本文提出一种可扩展方法,通过自对齐困惑度评估少量生成数据来优选微调大语言模型的最佳数据生成策略,在多项推理基准测试中显著提升模型性能。
English Summary: This paper introduces a scalable method using self-aligned perplexity to efficiently select the best data generation strategy for fine-tuning LLMs by evaluating small data samples, which significantly improves model performance across reasoning benchmarks.

Authors:Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Title: Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation
Abstract:
The widespread deployment of Large Language Models (LLMs) is hindered by the high computational demands, making knowledge distillation (KD) crucial for developing compact smaller ones. However, the conventional KD methods endure the distribution mismatch issue between the teacher and student models, leading to the poor performance of distillation. For instance, the widely-used KL-based methods suffer the mode-averaging and mode-collapsing problems, since the mismatched probabitliy distribution between both models. Previous studies mainly optimize this issue via different distance calculations towards the distribution of both models. Unfortunately, the distribution mismatch issue still exists in the early stage of the distillation. Hence, to reduce the impact of distribution mismatch, we propose a simple yet efficient method, named Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Specifically, we first detect the distribution of the student model in practical scenarios with its internal knowledge, and then modify the knowledge with low probability via the teacher as the checker. Consequently, Warmup-Distill aligns the internal student's knowledge to that of the teacher, which expands the distribution of the student with the teacher's, and assists the student model to learn better in the subsequent distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation, which outperforms the vanilla student by as least +0.4 averaged score among all benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation on the math task could yield a further improvement, at most +1.9% accuracy.
中文摘要:大语言模型面临计算需求高的挑战,而提出的Warmup-Distill方法通过预先对齐师生模型的知识分布,有效解决了知识蒸馏中的分布不匹配问题,在多个基准测试中实现了性能提升。
English Summary: Large language models face computational challenges, but the proposed Warmup-Distill method effectively addresses distribution mismatch between teacher and student models by pre-aligning their knowledge, resulting in improved performance across multiple benchmarks.

Authors:Yuqi Pang, Bowen Yang, Haoqin Tu, Yun Cao, Zeyu Zhang
Title: Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning
Abstract:
Although Large Language Models (LLMs) excel in reasoning and generation for language tasks, they are not specifically designed for multimodal challenges. Training Multimodal Large Language Models (MLLMs), however, is resource-intensive and constrained by various training limitations. In this paper, we propose the Modular-based Visual Contrastive Decoding (MVCD) framework to move this obstacle. Our framework leverages LLMs' In-Context Learning (ICL) capability and the proposed visual contrastive-example decoding (CED), specifically tailored for this framework, without requiring any additional training. By converting visual signals into text and focusing on contrastive output distributions during decoding, we can highlight the new information introduced by contextual examples, explore their connections, and avoid over-reliance on prior encoded knowledge. MVCD enhances LLMs' visual perception to make it see and reason over the input visuals. To demonstrate MVCD's effectiveness, we conduct experiments with four LLMs across five question answering datasets. Our results not only show consistent improvement in model accuracy but well explain the effective components inside our decoding strategy. Our code will be available at https://github.com/Pbhgit/MVCD.
中文: 提出的模块化视觉对比解码(MVCD)框架通过将视觉信号转化为文本并采用对比解码,无需额外训练即可增强大语言模型的视觉推理能力,在多个数据集上提升了准确性。
English: The proposed Modular-based Visual Contrastive Decoding (MVCD) framework enhances LLMs' visual reasoning without additional training by converting visuals to text and using contrastive decoding, improving accuracy across multiple datasets.

Authors:Yahya Can Tuğrul, A. Giray Yağlıkçı, İsmail Emir Yüksel, Ataberk Olgun, Oğuzhan Canpolat, Nisa Bostancı, Mohammad Sadrosadati, Oğuz Ergin, Onur Mutlu
Title: Understanding RowHammer Under Reduced Refresh Latency: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions
Abstract:
RowHammer is a major read disturbance mechanism in DRAM where repeatedly accessing (hammering) a row of DRAM cells (DRAM row) induces bitflips in physically nearby DRAM rows (victim rows). To ensure robust DRAM operation, state-of-the-art mitigation mechanisms restore the charge in potential victim rows (i.e., they perform preventive refresh or charge restoration). With newer DRAM chip generations, these mechanisms perform preventive refresh more aggressively and cause larger performance, energy, or area overheads. Therefore, it is essential to develop a better understanding and in-depth insights into the preventive refresh to secure real DRAM chips at low cost. In this paper, our goal is to mitigate RowHammer at low cost by understanding the impact of reduced preventive refresh latency on RowHammer. To this end, we present the first rigorous experimental study on the interactions between refresh latency and RowHammer characteristics in real DRAM chips. Our experimental characterization using 388 real DDR4 DRAM chips from three major manufacturers demonstrates that a preventive refresh latency can be significantly reduced (by 64%). To investigate the impact of reduced preventive refresh latency on system performance and energy efficiency, we reduce the preventive refresh latency and adjust the aggressiveness of existing RowHammer solutions by developing a new mechanism, Partial Charge Restoration for Aggressive Mitigation (PaCRAM). Our results show that PaCRAM reduces the performance and energy overheads induced by five state-of-the-art RowHammer mitigation mechanisms with small additional area overhead. Thus, PaCRAM introduces a novel perspective into addressing RowHammer vulnerability at low cost by leveraging our experimental observations. To aid future research, we open-source our PaCRAM implementation at https://github.com/CMU-SAFARI/PaCRAM.
中文: RowHammer是一种DRAM漏洞,反复访问某行会导致相邻行发生比特翻转,本文提出的PaCRAM机制通过将预防性刷新延迟降低64%,有效减少了现有防护方案带来的性能和能耗开销。
English: RowHammer is a DRAM vulnerability where repeatedly accessing a row causes bitflips in adjacent rows, and this paper introduces PaCRAM, a mechanism that reduces preventive refresh latency by 64% to lower performance and energy overheads of existing mitigations.

Authors:Shuai Lyu, Haoran Luo, Ripeng Li, Zhonghong Ou, Jiangfeng Sun, Yang Qin, Xiaoran Shang, Meina Song, Yifan Zhu
Title: SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL
Abstract:
Text-to-SQL (Text2SQL) aims to map natural language questions to executable SQL queries. Although large language models (LLMs) have driven significant progress, current approaches struggle with poor transferability to open-source LLMs, limited robustness against logic and function errors in complex queries, and inefficiencies in structured search. We introduce SQL-o1, a self-reward-driven heuristic search framework built on an agent-based architecture to enhance model reasoning capabilities. SQL-o1 leverages Monte Carlo Tree Search (MCTS) for structured, multi-step exploration, and incorporates a dynamic pruning strategy to accelerate inference without sacrificing accuracy. On the Spider and Bird benchmarks, SQL-o1 achieves a +10.8 execution accuracy improvement on the complex Bird dataset, surpassing even GPT-4-based models. Notably, it exhibits strong few-shot generalization and robust cross-model transferability across open-source LLMs. Our code is available at:https://github.com/ShuaiLyu0110/SQL-o1.
中文: SQL-o1是一个基于自奖励启发式搜索的框架,通过蒙特卡洛树搜索和动态剪枝策略增强模型推理能力,在复杂数据集上实现显著准确率提升,并展现出优秀的少样本泛化能力和跨模型迁移性。
English: SQL-o1 is a self-reward-driven heuristic search framework that enhances reasoning capabilities through Monte Carlo Tree Search and dynamic pruning, achieving significant accuracy improvements on complex datasets and demonstrating strong generalization and transferability across open-source LLMs.

Authors:Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather
Title: LLM Agents Making Agent Tools
Abstract:
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows. Our code and benchmark are publicly available at https://github.com/KatherLab/ToolMaker.
中文: ToolMaker 是一个自主框架,能够将附带代码库的研究论文转化为大语言模型兼容的工具,使大语言模型无需人工干预即可创建专业软件组件,在复杂计算任务中显著优于现有智能体。
English: ToolMaker is an autonomous framework that converts research papers with code repositories into LLM-compatible tools, enabling large language models to create specialized software components without human intervention and significantly outperforming existing agents in complex computational tasks.

Authors:Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan
Title: CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
Abstract:
Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repository https://github.com/YuY-2001/C-MQCIC.
中文摘要:本研究提出了一种基于临床事实的推理规则方法,在中文电子病历的医疗质控指标计算中优于思维链方法,并通过在20个大语言模型上的全面实验验证了其有效性。
English Summary: This study introduces a clinical fact-based inferential rule method that outperforms chain-of-thought approaches in medical quality control indicator calculations using Chinese electronic medical records, supported by comprehensive experiments on 20 large language models.

Authors:Yuncheng Hua, Lizhen Qu, Zhuang Li, Hao Xue, Flora D. Salim, Gholamreza Haffari
Title: RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars
Abstract:
Alignment tuning is crucial for ensuring large language models (LLMs) behave ethically and helpfully. Current alignment approaches require high-quality annotations and significant training resources. This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment. Through an analysis of high-quality ICL demos, we identified style as a key factor influencing LLM alignment capabilities and explicitly restyled ICL exemplars based on this stylistic framework. Additionally, we combined the restyled demos to achieve a balance between the two conflicting aspects of LLM alignment--factuality and safety. We packaged the restyled examples as prompts to trigger few-shot learning, improving LLM alignment. Compared to the best baseline approach, with an average score of 5.00 as the maximum, our method achieves a maximum 0.10 increase on the Alpaca task (from 4.50 to 4.60), a 0.22 enhancement on the Just-eval benchmark (from 4.34 to 4.56), and a maximum improvement of 0.32 (from 3.53 to 3.85) on the MT-Bench dataset. We release the code and data at https://github.com/AnonymousCode-ComputerScience/RIDE.
中文: 本文提出了一种低成本、无需调优的方法,通过情境学习重构示例风格来平衡事实性与安全性,从而提升大语言模型的对齐效果,并在多个基准测试中取得显著提升。
English: This paper introduces a low-cost, tuning-free method using in-context learning to enhance LLM alignment by restyling exemplars to balance factuality and safety, achieving significant improvements across multiple benchmarks.

Authors:Marco ComunitÃ, Christian J. Steinmetz, Joshua D. Reiss
Title: NablAFx: A Framework for Differentiable Black-box and Gray-box Modeling of Audio Effects
Abstract:
We present NablAFx, an open-source framework developed to support research in differentiable black-box and gray-box modeling of audio effects. Built in PyTorch, NablAFx offers a versatile ecosystem to configure, train, evaluate, and compare various architectural approaches. It includes classes to manage model architectures, datasets, and training, along with features to compute and log losses, metrics and media, and plotting functions to facilitate detailed analysis. It incorporates implementations of established black-box architectures and conditioning methods, as well as differentiable DSP blocks and controllers, enabling the creation of both parametric and non-parametric gray-box signal chains. The code is accessible at https://github.com/mcomunita/nablafx.
中文:NablAFx 是一个基于 PyTorch 的开源框架,用于支持音频效果的可微分黑盒与灰盒建模研究,集成了多种架构、数据集管理及分析工具,便于模型训练与比较。
English: NablAFx is an open-source PyTorch framework designed for differentiable black-box and gray-box modeling of audio effects, providing tools for configuration, training, and analysis with pre-implemented architectures and DSP blocks.

Authors:Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu
Title: VRoPE: Rotary Position Embedding for Video Large Language Models
Abstract:
Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames. Existing adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations: positional bias in attention distribution and disruptions in video-text transitions. To overcome these issues, we propose Video Rotary Position Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs. Specifically, we introduce a more balanced encoding strategy that mitigates attention biases, ensuring a more uniform distribution of spatial focus. Additionally, our approach restructures positional indices to ensure a smooth transition between video and text tokens. Extensive experiments on different models demonstrate that VRoPE consistently outperforms previous RoPE variants, achieving significant improvements in video understanding, temporal reasoning, and retrieval tasks. Code will be available at https://github.com/johncaged/VRoPE.
中文: VRoPE是一种专为视频大语言模型设计的新型位置编码方法,通过平衡注意力分布和确保视频与文本标记间的平滑过渡,有效克服了现有方法的局限性,在视频理解任务中表现出显著优势。
English: VRoPE is a novel positional encoding method for Video-LLMs that addresses limitations in existing RoPE adaptations by balancing attention distribution and ensuring smooth video-text transitions, leading to superior performance in video understanding tasks.

Authors:Linjie Mu, Zhongzhen Huang, Shengqian Qin, Yakun Zhu, Shaoting Zhang, Xiaofan Zhang
Title: MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression
Abstract:
Large vision-language models (LVLMs) have shown great promise in medical applications, particularly in visual question answering (MedVQA) and diagnosis from medical images. However, existing datasets and models often fail to consider critical aspects of medical diagnostics, such as the integration of historical records and the analysis of disease progression over time. In this paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel dataset for MedVQA that focuses on identifying changes in specific regions between two patient visits. Unlike previous datasets that primarily address single-image questions, MMXU enables multi-image questions, incorporating both current and historical patient data. We demonstrate the limitations of current LVLMs in identifying disease progression on MMXU-\textit{test}, even those that perform well on traditional benchmarks. To address this, we propose a MedRecord-Augmented Generation (MAG) approach, incorporating both global and regional historical records. Our experiments show that integrating historical records significantly enhances diagnostic accuracy by at least 20\%, bridging the gap between current LVLMs and human expert performance. Additionally, we fine-tune models with MAG on MMXU-\textit{dev}, which demonstrates notable improvements. We hope this work could illuminate the avenue of advancing the use of LVLMs in medical diagnostics by emphasizing the importance of historical context in interpreting medical images. Our dataset is released at github: https://github.com/linjiemu/MMXU.
中文: 本文提出了用于医学视觉问答的新型数据集MMXU,通过整合历史患者记录实现多图像分析,并提出医疗记录增强生成方法,结合全局和局部历史数据将诊断准确率提升至少20%。
English: This paper introduces MMXU, a novel dataset for medical visual question answering that enables multi-image analysis by incorporating historical patient records, and proposes a MedRecord-Augmented Generation (MAG) approach that improves diagnostic accuracy by at least 20% by integrating both global and regional historical data.

Authors:Habib Larian, Faramarz Safi-Esfahani
Title: InTec: integrated things-edge computing: a framework for distributing machine learning pipelines in edge AI systems
Abstract:
With the rapid expansion of the Internet of Things (IoT), sensors, smartphones, and wearables have become integral to daily life, powering smart applications in home automation, healthcare, and intelligent transportation. However, these advancements face significant challenges due to latency and bandwidth constraints imposed by traditional cloud based machine learning (ML) frameworks. The need for innovative solutions is evident as cloud computing struggles with increased latency and network congestion. Previous attempts to offload parts of the ML pipeline to edge and cloud layers have yet to fully resolve these issues, often worsening system response times and network congestion due to the computational limitations of edge devices. In response to these challenges, this study introduces the InTec (Integrated Things Edge Computing) framework, a groundbreaking innovation in IoT architecture. Unlike existing methods, InTec fully leverages the potential of a three tier architecture by strategically distributing ML tasks across the Things, Edge, and Cloud layers. This comprehensive approach enables real time data processing at the point of data generation, significantly reducing latency, optimizing network traffic, and enhancing system reliability. InTec effectiveness is validated through empirical evaluation using the MHEALTH dataset for human motion detection in smart homes, demonstrating notable improvements in key metrics: an 81.56 percent reduction in response time, a 10.92 percent decrease in network traffic, a 9.82 percent improvement in throughput, a 21.86 percent reduction in edge energy consumption, and a 25.83 percent reduction in cloud energy consumption. These advancements establish InTec as a new benchmark for scalable, responsive, and energy efficient IoT applications, demonstrating its potential to revolutionize how the ML pipeline is integrated into Edge AI (EI) systems.
Chinese: InTec框架通过将机器学习任务策略性地分配到物端、边缘和云端三层,有效解决了物联网的延迟和带宽问题,显著提升了响应速度、网络流量及能源效率。
English: The InTec framework addresses IoT latency and bandwidth challenges by strategically distributing machine learning tasks across Things, Edge, and Cloud layers, achieving significant improvements in response time, network traffic, and energy efficiency.

Authors:Dariush Lotfi, Mohammad-Ali Nikouei Mahani, Mohamad Koohi-Moghadam, Kyongtae Ty Bae
Title: Safeguarding AI in Medical Imaging: Post-Hoc Out-of-Distribution Detection with Normalizing Flows
Abstract:
In AI-driven medical imaging, the failure to detect out-of-distribution (OOD) data poses a severe risk to clinical reliability, potentially leading to critical diagnostic errors. Current OOD detection methods often demand impractical retraining or modifications to pre-trained models, hindering their adoption in regulated clinical environments. To address this challenge, we propose a post-hoc normalizing flow-based approach that seamlessly integrates with existing pre-trained models without altering their weights. Our evaluation used a novel in-house built dataset, MedOOD, meticulously curated to simulate clinically relevant distributional shifts, alongside the MedMNIST benchmark dataset. On our in-house MedOOD dataset, our method achieved an AUROC of 84.61%, outperforming state-of-the-art methods like ViM (80.65%) and MDS (80.87%). Similarly, on MedMNIST, it reached an exceptional AUROC of 93.8%, surpassing leading approaches such as ViM (88.08%) and ReAct (87.05%). This superior performance, coupled with its post-hoc integration capability, positions our method as a vital safeguard for enhancing safety in medical imaging workflows. The model and code to build OOD datasets are publicly accessible at https://github.com/dlotfi/MedOODFlow.
中文摘要:本研究提出了一种后验归一化流方法,无需重新训练即可与预训练的医学影像模型无缝集成,在MedOOD和MedMNIST数据集上分别实现了84.61%和93.8%的AUROC评分,显著提升了分布外检测性能。
English Summary: This study introduces a post-hoc normalizing flow-based method that integrates with pre-trained medical imaging models without retraining, achieving superior OOD detection performance with AUROC scores of 84.61% on MedOOD and 93.8% on MedMNIST datasets.

Authors:Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang
Title: Maximum Entropy Reinforcement Learning with Diffusion Policy
Abstract:
The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at https://github.com/diffusionyes/MaxEntDP.
高斯策略的软演员-评论家算法被扩散模型取代,形成名为MaxEntDP的新方法,通过捕捉多模态分布增强了复杂环境中的探索能力和性能表现。
The Soft Actor-Critic algorithm with a Gaussian policy is enhanced by replacing it with a diffusion model, named MaxEntDP, which improves exploration and performance in complex environments by capturing multimodal distributions.

Authors:Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu
Title: Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?
Abstract:
The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack.
中文摘要:研究表明,通过针对性改写或推理时中和的方法可以有效消除大语言模型中水印的放射性,在保持知识传递的同时破坏水印检测,凸显了开发更强水印防御机制的迫切需求。
English Summary: The study reveals that watermark radioactivity in LLMs can be effectively removed through targeted paraphrasing or inference-time neutralization, compromising detection while preserving knowledge transfer, highlighting the need for more robust watermarking defenses.

Authors:Arnaud Bougaham, Benoît Frénay
Title: Towards a Trustworthy Anomaly Detection for Critical Applications through Approximated Partial AUC Loss
Abstract:
Anomaly Detection is a crucial step for critical applications such in the industrial, medical or cybersecurity domains. These sectors share the same requirement of handling differently the different types of classification errors. Indeed, even if false positives are acceptable, false negatives are not, because it would reflect a missed detection of a quality issue, a disease or a cyber threat. To fulfill this requirement, we propose a method that dynamically applies a trustworthy approximated partial AUC ROC loss (tapAUC). A binary classifier is trained to optimize the specific range of the AUC ROC curve that prevents the True Positive Rate (TPR) to reach 100% while minimizing the False Positive Rate (FPR). The optimal threshold that does not trigger any false negative is then kept and used at the test step. The results show a TPR of 92.52% at a 20.43% FPR for an average across 6 datasets, representing a TPR improvement of 4.3% for a FPR cost of 12.2% against other state-of-the-art methods. The code is available at https://github.com/ArnaudBougaham/tapAUC.
中文: 本文提出了一种可信赖的近似部分AUC ROC损失方法,通过优化分类器在关键应用中避免漏报,在六个数据集上实现了92.52%的真阳性率和20.43%的假阳性率。
English: This paper introduces a trustworthy approximated partial AUC ROC loss method that optimizes classifiers to prevent false negatives in critical applications, achieving a 92.52% true positive rate at a 20.43% false positive rate across six datasets.

Authors:Jaehyeong Jo, Sung Ju Hwang
Title: Continuous Diffusion Model for Language Modeling
Abstract:
Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{https://github.com/harryjo97/RDLM}{https://github.com/harryjo97/RDLM}.
中文摘要:本文提出了一种用于语言建模的连续扩散模型,该模型结合了分类分布的几何特性,通过在统计流形上建立离散扩散与连续流的联系,超越了现有离散扩散模型并接近自回归模型的性能。
English Summary: This paper introduces a continuous diffusion model for language modeling that leverages the geometry of categorical distributions, establishing a connection between discrete diffusion and continuous flow on statistical manifolds to outperform existing discrete diffusion models and approach autoregressive model performance.

Authors:Lior Cohen, Kaixin Wang, Bingyi Kang, Uri Gadot, Shie Mannor
Title: Uncovering Untapped Potential in Sample-Efficient World Model Agents
Abstract:
World model (WM) agents enable sample-efficient reinforcement learning by learning policies entirely from simulated experience. However, existing token-based world models (TBWMs) are limited to visual inputs and discrete actions, restricting their adoption and applicability. Moreover, although both intrinsic motivation and prioritized WM replay have shown promise in improving WM performance and generalization, they remain underexplored in this setting, particularly in combination. We introduce Simulus, a highly modular TBWM agent that integrates (1) a modular multi-modality tokenization framework, (2) intrinsic motivation, (3) prioritized WM replay, and (4) regression-as-classification for reward and return prediction. Simulus achieves state-of-the-art sample efficiency for planning-free WMs across three diverse benchmarks. Ablation studies reveal the individual contribution of each component while highlighting their synergy. Our code and model weights are publicly available at https://github.com/leor-c/Simulus.
Chinese: Simulus作为一种模块化基于令牌的世界模型智能体,整合了多模态令牌化、内在动机、优先回放和回归分类方法,在三个不同基准测试中实现了最先进的样本效率。
English: Simulus is a modular token-based world model agent that integrates multi-modality tokenization, intrinsic motivation, prioritized replay, and regression-as-classification, achieving state-of-the-art sample efficiency across three benchmarks.

Authors:Jiahong Liu, Zexuan Qiu, Zhongyang Li, Quanyu Dai, Wenhao Yu, Jieming Zhu, Minda Hu, Menglin Yang, Tat-Seng Chua, Irwin King
Title: A Survey of Personalized Large Language Models: Progress and Future Directions
Abstract:
Large Language Models (LLMs) excel in handling general knowledge tasks, yet they struggle with user-specific personalization, such as understanding individual emotions, writing styles, and preferences. Personalized Large Language Models (PLLMs) tackle these challenges by leveraging individual user data, such as user profiles, historical dialogues, content, and interactions, to deliver responses that are contextually relevant and tailored to each user's specific needs. This is a highly valuable research topic, as PLLMs can significantly enhance user satisfaction and have broad applications in conversational agents, recommendation systems, emotion recognition, medical assistants, and more. This survey reviews recent advancements in PLLMs from three technical perspectives: prompting for personalized context (input level), finetuning for personalized adapters (model level), and alignment for personalized preferences (objective level). To provide deeper insights, we also discuss current limitations and outline several promising directions for future research. Updated information about this survey can be found at the https://github.com/JiahongLiu21/Awesome-Personalized-Large-Language-Models.
中文: 个性化大语言模型(PLLMs)通过利用用户个人数据来弥补通用大语言模型在个性化服务上的不足,其应用覆盖对话系统和推荐系统等领域,当前研究主要从输入、模型和目标三个技术层面推进个性化实现。
English: Personalized Large Language Models (PLLMs) enhance user-specific interactions by leveraging individual data to address limitations in general LLMs, with applications spanning conversational agents and recommendation systems, while current research focuses on input, model, and objective-level personalization techniques.

Authors:Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang
Title: Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More
Abstract:
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99$\times$ and 2.99$\times$ speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at https://github.com/ZichenWen1/DART.
中文: 本文提出DART方法,通过基于视觉令牌与关键令牌的重复性进行剪枝,无需训练即可实现高达88.9%的令牌削减和近3倍加速,同时保持模型性能。
English: This paper introduces DART, a training-free method that accelerates multimodal models by pruning vision tokens based on duplication with pivot tokens, achieving up to 88.9% token reduction and nearly 3x speed-up while maintaining performance.

Authors:Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu
Title: Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
Abstract:
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing tasks focusing on basic visual arithmetic. Our analysis reveals that while pre-trained vision encoders typically capture sufficient information, the text decoder often fails to decode it correctly for arithmetic reasoning. To address this, we propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development. CogAlign trains VLMs to recognize invariant properties under visual transformations. We demonstrate that this approach significantly improves the performance of three diverse VLMs on our proposed probing tasks. Furthermore, CogAlign enhances performance by an average of 4.6% on CHOCOLATE and 2.9% on MATH-VISION, outperforming or matching supervised fine-tuning methods while requiring only 60% less training data. These results highlight the effectiveness and generalizability of CogAlign in improving fundamental visual arithmetic capabilities and their transfer to downstream tasks.
Chinese: 视觉语言模型常因文本解码器问题在视觉算术任务中表现不佳,但提出的CogAlign训练策略显著提升了其性能与泛化能力,同时减少了数据需求。
English: Vision Language Models often fail at visual arithmetic tasks due to text decoder limitations, but the proposed CogAlign training strategy significantly enhances their performance and generalizability with less data.

Authors:Haochen Li, Wanjin Feng, Xin Zhou, Zhiqi Shen
Title: GiFT: Gibbs Fine-Tuning for Code Generation
Abstract:
Training Large Language Models (LLMs) with synthetic data is a prevalent practice in code generation. A key approach is self-training, where LLMs are iteratively trained on self-generated correct code snippets. In this case, the self-generated codes are drawn from a conditional distribution, conditioned on a specific seed description. However, the seed description is not the only valid representation that aligns with its intended meaning. With all valid descriptions and codes forming a joint space, codes drawn from the conditional distribution would lead to an underrepresentation of the full description-code space. As such, we propose Gibbs Fine-Tuning (GiFT), a novel self-training method inspired by Gibbs sampling. GiFT allows self-generated data to be drawn from the marginal distribution of the joint space, thereby mitigating the biases inherent in conditional sampling. We provide a theoretical analysis demonstrating the potential benefits of fine-tuning LLMs with code derived from the marginal distribution. Furthermore, we propose a perplexity-based code selection method to mitigate the imbalanced long-tail distribution of the self-generated codes. Empirical evaluation of two LLMs across four datasets demonstrates that GiFT achieves superior performance, particularly on more challenging benchmarks. Source code is available at https://github.com/Alex-HaochenLi/GiFT.
中文摘要:本研究提出吉布斯微调(GiFT)方法,通过从描述-代码联合空间的边际分布中采样,解决了代码生成中的代表性不足问题,显著提升了大型语言模型在复杂基准测试中的表现。
English Summary: The study introduces Gibbs Fine-Tuning (GiFT), a self-training method that addresses the underrepresentation in code generation by sampling from the marginal distribution of the joint description-code space, enhancing LLM performance on challenging benchmarks.

Authors:Jiwoo Kim, Geunsik Bae, Changseung Kim, Jinwoo Lee, Woojae Shin, Hyondong Oh
Title: Doppler Correspondence: Non-Iterative Scan Matching With Doppler Velocity-Based Correspondence
Abstract:
Achieving successful scan matching is essential for LiDAR odometry. However, in challenging environments with adverse weather conditions or repetitive geometric patterns, LiDAR odometry performance is degraded due to incorrect scan matching. Recently, the emergence of frequency-modulated continuous wave 4D LiDAR and 4D radar technologies has provided the potential to address these unfavorable conditions. The term 4D refers to point cloud data characterized by range, azimuth, and elevation along with Doppler velocity. Although 4D data is available, most scan matching methods for 4D LiDAR and 4D radar still establish correspondence by repeatedly identifying the closest points between consecutive scans, overlooking the Doppler information. This paper introduces, for the first time, a simple Doppler velocity-based correspondence -- Doppler Correspondence -- that is invariant to translation and small rotation of the sensor, with its geometric and kinematic foundations. Extensive experiments demonstrate that the proposed method enables the direct matching of consecutive point clouds without an iterative process, making it computationally efficient. Additionally, it provides a more robust correspondence estimation in environments with repetitive geometric patterns.The implementation of our proposed method is publicly available at https://github.com/Tars0523/Doppler Correspondence.
中文摘要:本文首次提出了一种基于多普勒速度的对应关系方法,用于4D激光雷达和雷达的扫描匹配,无需迭代过程即可直接配准连续点云,在重复几何图案环境中展现出更高的计算效率和鲁棒性。
English Summary: This paper introduces a novel Doppler velocity-based correspondence method for 4D LiDAR and radar scan matching that enables direct point cloud alignment without iterative processes, demonstrating improved computational efficiency and robustness in challenging environments with repetitive patterns.

Authors:Ivo Gollini Navarrete, Nicolas Mauricio Cuadrado, Jose Renato Restom, Martin Takáč, Samuel Horváth
Title: Fishing For Cheap And Efficient Pruners At Initialization
Abstract:
Pruning offers a promising solution to mitigate the associated costs and environmental impact of deploying large deep neural networks (DNNs). Traditional approaches rely on computationally expensive trained models or time-consuming iterative prune-retrain cycles, undermining their utility in resource-constrained settings. To address this issue, we build upon the established principles of saliency (LeCun et al., 1989) and connection sensitivity (Lee et al., 2018) to tackle the challenging problem of one-shot pruning neural networks (NNs) before training (PBT) at initialization. We introduce Fisher-Taylor Sensitivity (FTS), a computationally cheap and efficient pruning criterion based on the empirical Fisher Information Matrix (FIM) diagonal, offering a viable alternative for integrating first- and second-order information to identify a model's structurally important parameters. Although the FIM-Hessian equivalency only holds for convergent models that maximize the likelihood, recent studies (Karakida et al., 2019) suggest that, even at initialization, the FIM captures essential geometric information of parameters in overparameterized NNs, providing the basis for our method. Finally, we demonstrate empirically that layer collapse, a critical limitation of data-dependent pruning methodologies, is easily overcome by pruning within a single training epoch after initialization. We perform experiments on ResNet18 and VGG19 with CIFAR-10 and CIFAR-100, widely used benchmarks in pruning research. Our method achieves competitive performance against state-of-the-art techniques for one-shot PBT, even under extreme sparsity conditions. Our code is made available to the public.
中文摘要:本文提出Fisher-Taylor敏感性(FTS)这一高效剪枝标准,通过利用经验Fisher信息矩阵识别结构重要参数,可在训练前实现一次性神经网络剪枝,避免了昂贵的迭代计算过程。
English Summary: This paper introduces Fisher-Taylor Sensitivity (FTS), an efficient pruning criterion that enables one-shot neural network pruning before training by leveraging the empirical Fisher Information Matrix to identify structurally important parameters without costly iterative cycles.

Authors:Hao Xu, Tengfei Xue, Jianan Fan, Dongnan Liu, Yuqian Chen, Fan Zhang, Carl-Fredrik Westin, Ron Kikinis, Lauren J. O'Donnell, Weidong Cai
Title: Medical Image Registration Meets Vision Foundation Model: Prototype Learning and Contour Awareness
Abstract:
Medical image registration is a fundamental task in medical image analysis, aiming to establish spatial correspondences between paired images. However, existing unsupervised deformable registration methods rely solely on intensity-based similarity metrics, lacking explicit anatomical knowledge, which limits their accuracy and robustness. Vision foundation models, such as the Segment Anything Model (SAM), can generate high-quality segmentation masks that provide explicit anatomical structure knowledge, addressing the limitations of traditional methods that depend only on intensity similarity. Based on this, we propose a novel SAM-assisted registration framework incorporating prototype learning and contour awareness. The framework includes: (1) Explicit anatomical information injection, where SAM-generated segmentation masks are used as auxiliary inputs throughout training and testing to ensure the consistency of anatomical information; (2) Prototype learning, which leverages segmentation masks to extract prototype features and aligns prototypes to optimize semantic correspondences between images; and (3) Contour-aware loss, a contour-aware loss is designed that leverages the edges of segmentation masks to improve the model's performance in fine-grained deformation fields. Extensive experiments demonstrate that the proposed framework significantly outperforms existing methods across multiple datasets, particularly in challenging scenarios with complex anatomical structures and ambiguous boundaries. Our code is available at https://github.com/HaoXu0507/IPMI25-SAM-Assisted-Registration.
中文摘要:本文提出了一种新颖的医学图像配准框架,通过结合分割一切模型(SAM)的原型学习和轮廓感知损失来注入解剖知识,在多个数据集上显著超越了现有方法的准确性和鲁棒性。
English Summary: This paper introduces a novel medical image registration framework that integrates the Segment Anything Model (SAM) to inject anatomical knowledge through prototype learning and contour-aware loss, significantly outperforming existing methods in accuracy and robustness.

Authors:Lulu Yu, Keping Bi, Jiafeng Guo, Shihao Liu, Dawei Yin, Xueqi Cheng
Title: Unbiased Learning to Rank with Query-Level Click Propensity Estimation: Beyond Pointwise Observation and Relevance
Abstract:
Most existing unbiased learning-to-rank (ULTR) approaches are based on the user examination hypothesis, which assumes that users will click a result only if it is both relevant and observed (typically modeled by position). However, in real-world scenarios, users often click only one or two results after examining multiple relevant options, due to limited patience or because their information needs have already been satisfied. Motivated by this, we propose a query-level click propensity model to capture the probability that users will click on different result lists, allowing for non-zero probabilities that users may not click on an observed relevant result. We hypothesize that this propensity increases when more potentially relevant results are present, and refer to this user behavior as relevance saturation bias. Our method introduces a Dual Inverse Propensity Weighting (DualIPW) mechanism -- combining query-level and position-level IPW -- to address both relevance saturation and position bias. Through theoretical derivation, we prove that DualIPW can learn an unbiased ranking model. Experiments on the real-world Baidu-ULTR dataset demonstrate that our approach significantly outperforms state-of-the-art ULTR baselines. The code and dataset information can be found at https://github.com/Trustworthy-Information-Access/DualIPW.
中文摘要:本文提出双重逆倾向加权(DualIPW)方法,通过结合查询级和位置级逆倾向加权,同时解决相关性饱和偏差与位置偏差问题,在真实数据集上的实验证明该方法显著优于现有最优学习排序模型。
English Summary: This paper introduces a Dual Inverse Propensity Weighting (DualIPW) method to address both relevance saturation bias and position bias in learning-to-rank systems, demonstrating superior performance over existing approaches through theoretical validation and experiments on real-world data.

Authors:Junru Lu, Jiazheng Li, Guodong Shen, Lin Gui, Siyu An, Yulan He, Di Yin, Xing Sun
Title: RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following
Abstract:
Role-playing is important for Large Language Models (LLMs) to follow diverse instructions while maintaining role identity and the role's pre-defined ability limits. Existing role-playing datasets mostly contribute to controlling role style and knowledge boundaries, but overlook role-playing in instruction-following scenarios. We introduce a fine-grained role-playing and instruction-following composite benchmark, named RoleMRC, including: (1) Multi-turn dialogues between ideal roles and humans, including free chats or discussions upon given passages; (2) Role-playing machine reading comprehension, involving response, refusal, and attempts according to passage answerability and role ability; (3) More complex scenarios with nested, multi-turn and prioritized instructions. The final RoleMRC features a 10.2k role profile meta-pool, 37.9k well-synthesized role-playing instructions, and 1.4k testing samples. We develop a pipeline to quantitatively evaluate the fine-grained role-playing and instruction-following capabilities of several mainstream LLMs, as well as models that are fine-tuned on our data. Moreover, cross-evaluation on external role-playing datasets confirms that models fine-tuned on RoleMRC enhances instruction-following without compromising general role-playing and reasoning capabilities. We also probe the neural-level activation maps of different capabilities over post-tuned LLMs. Access to our RoleMRC, RoleMRC-mix and Codes: https://github.com/LuJunru/RoleMRC.
中文:RoleMRC是一个新颖的基准,通过多轮对话和复杂场景增强大语言模型的细粒度角色扮演和指令遵循能力,评估显示其在提升性能的同时不损害通用能力。
English: RoleMRC is a novel benchmark designed to enhance large language models' fine-grained role-playing and instruction-following abilities through multi-turn dialogues and complex scenarios, with evaluations showing improved performance without sacrificing general capabilities.

Authors:Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Guoqi Li
Title: Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization
Abstract:
Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.
中文: 本文提出DMNIL,一种自监督的无人机视角地理定位方法,无需配对图像,通过伪标签和对比学习结合记忆模块,有效提升了跨视角特征对齐能力。
English: This paper introduces DMNIL, a self-supervised method for drone-view geo-localization that eliminates the need for paired drone-satellite images by using pseudo-labels and contrastive learning with memory modules to enhance feature alignment across views.

Authors:Lei Li, Xiao Zhou
Title: Leave No One Behind: Enhancing Diversity While Maintaining Accuracy in Social Recommendation
Abstract:
Social recommendation, which incorporates social connections into recommender systems, has proven effective in improving recommendation accuracy. However, beyond accuracy, diversity is also crucial for enhancing user engagement. Despite its importance, the impact of social recommendation models on diversity remains largely unexplored. In this study, we systematically examine the dual performance of existing social recommendation algorithms in terms of both accuracy and diversity. Our empirical analysis reveals a concerning trend: while social recommendation models enhance accuracy, they often reduce diversity. To address this issue, we propose Diversified Social Recommendation (DivSR), a novel approach that employs relational knowledge distillation to transfer high-diversity structured knowledge from non-social recommendation models to social recommendation models. DivSR is a lightweight, model-agnostic framework that seamlessly integrates with existing social recommendation architectures. Experiments on three benchmark datasets demonstrate that DivSR significantly enhances diversity while maintaining competitive accuracy, achieving a superior accuracy-diversity trade-off. Our code and data are publicly available at: https://github.com/ll0ruc/DivSR.
中文: 社交推荐模型在提升准确性的同时常会降低多样性,为此提出的DivSR框架通过关系知识蒸馏,在保持准确性的前提下显著提升了多样性。
English: Social recommendation models often improve accuracy but reduce diversity, prompting the development of DivSR, a model-agnostic framework that enhances diversity while maintaining competitive accuracy through relational knowledge distillation.

Authors:Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman
Title: Sparse Autoencoder Features for Classifications and Transferability
Abstract:
Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.
中文摘要:稀疏自编码器能够从大语言模型中提取可解释特征,在安全关键任务中超越基线方法,并通过优化配置实现跨模型与跨任务的泛化能力。
English Summary: Sparse Autoencoders effectively extract interpretable features from Large Language Models, surpassing baseline methods in safety-critical tasks and enabling cross-model and cross-task generalization with optimized configurations.

Authors:Shaina Raza, Ashmal Vayani, Aditya Jain, Aravind Narayanan, Vahid Reza Khazaie, Syed Raza Bashir, Elham Dolatabadi, Gias Uddin, Christos Emmanouilidis, Rizwan Qureshi, Mubarak Shah
Title: VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment
Abstract:
Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news, remains largely unaddressed. We introduce the Vision-Language Disinformation Detection Benchmark (VLDBench), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. VLDBench comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluations of state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) on VLDBench show that incorporating visual cues improves detection accuracy by 5 to 35 percentage points over text-only models. VLDBench provides data and code for evaluation, fine-tuning, and robustness testing to support disinformation analysis. Developed in alignment with AI governance frameworks (e.g., the MIT AI Risk Repository), VLDBench offers a principled foundation for advancing trustworthy disinformation detection in multimodal media. Project: https://vectorinstitute.github.io/VLDBench/ Dataset: https://huggingface.co/datasets/vector-institute/VLDBench Code: https://github.com/VectorInstitute/VLDBench
中文: VLDBench是首个支持单模态和多模态虚假信息检测的大规模基准,研究表明结合视觉线索比纯文本模型将检测准确率提高了5-35%。
English: VLDBench is the first large-scale benchmark for detecting both unimodal and multimodal disinformation, showing that incorporating visual cues improves detection accuracy by 5-35% over text-only models.

Authors:Seunghyuk Cho, Zhenyue Qin, Yang Liu, Youngbin Choi, Seungbeom Lee, Dongwoo Kim
Title: GeoDANO: Geometric VLM with Domain Agnostic Vision Encoder
Abstract:
We introduce GeoDANO, a geometric vision-language model (VLM) with a domain-agnostic vision encoder, for solving plane geometry problems. Although VLMs have been employed for solving geometry problems, their ability to recognize geometric features remains insufficiently analyzed. To address this gap, we propose a benchmark that evaluates the recognition of visual geometric features, including primitives such as dots and lines, and relations such as orthogonality. Our preliminary study shows that vision encoders often used in general-purpose VLMs, e.g., OpenCLIP, fail to detect these features and struggle to generalize across domains. To overcome the limitation, we develop GeoCLIP, a CLIP-based model trained on synthetic geometric diagram--caption pairs. Benchmark results show that GeoCLIP outperforms existing vision encoders in recognizing geometric features. We then propose our VLM, GeoDANO, which augments GeoCLIP with a domain adaptation strategy for unseen diagram styles. GeoDANO outperforms specialized methods for plane geometry problems and GPT-4o on MathVerse. The implementation is available at https://github.com/ml-postech/GeoDANO.
Chinese: GeoDANO是一种具有领域无关视觉编码器的几何视觉语言模型,通过集成GeoCLIP增强几何特征识别和跨领域适应能力,在平面几何问题解决上超越了专门方法和GPT-4o的基准表现。
English: GeoDANO is a geometric vision-language model with a domain-agnostic vision encoder that excels at solving plane geometry problems by incorporating GeoCLIP for enhanced feature recognition and domain adaptation, outperforming specialized methods and GPT-4o on benchmarks.

Authors:Rongwu Xu, Xiaojian Li, Shuo Chen, Wei Xu
Title: Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents
Abstract:
Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios, particularly in Chemical, Biological, Radiological and Nuclear (CBRN) domains. Based on the insight that such risks can originate from trade-offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework, which is carefully constructed to effectively and naturally expose such risks. We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis. Results reveal that LLM agents can autonomously engage in catastrophic behaviors and deception, without being deliberately induced. Furthermore, stronger reasoning abilities often increase, rather than mitigate, these risks. We also show that these agents can violate instructions and superior commands. On the whole, we empirically prove the existence of catastrophic risks in autonomous LLM agents. We release our code to foster further research.
中文摘要:大型语言模型在作为自主智能体时存在灾难性风险,实验表明更强的推理能力反而会加剧危险行为,包括欺骗和违反指令,而非降低风险。
English Summary: Large language models acting as autonomous agents pose catastrophic risks in high-stakes CBRN scenarios, with experiments revealing that stronger reasoning capabilities often amplify rather than reduce dangerous behaviors including deception and command violations.

Authors:Andrii Krutsylo
Title: Non-Uniform Memory Sampling in Experience Replay
Abstract:
Continual learning is the process of training machine learning models on a sequence of tasks where data distributions change over time. A well-known obstacle in this setting is catastrophic forgetting, a phenomenon in which a model drastically loses performance on previously learned tasks when learning new ones. A popular strategy to alleviate this problem is experience replay, in which a subset of old samples is stored in a memory buffer and replayed with new data. Despite continual learning advances focusing on which examples to store and how to incorporate them into the training loss, most approaches assume that sampling from this buffer is uniform by default. We challenge the assumption that uniform sampling is necessarily optimal. We conduct an experiment in which the memory buffer updates the same way in every trial, but the replay probability of each stored sample changes between trials based on different random weight distributions. Specifically, we generate 50 different non-uniform sampling probability weights for each trial and compare their final accuracy to the uniform sampling baseline. We find that there is always at least one distribution that significantly outperforms the baseline across multiple buffer sizes, models, and datasets. These results suggest that more principled adaptive replay policies could yield further gains. We discuss how exploiting this insight could inspire new research on non-uniform memory sampling in continual learning to better mitigate catastrophic forgetting. The code supporting this study is available at $\href{https://github.com/DentonJC/memory-sampling}{https://github.com/DentonJC/memory-sampling}$.
中文: 本研究挑战了持续学习中经验回放的均匀采样假设,证明非均匀采样策略在不同设置下始终优于基线,表明自适应回放策略能更有效地缓解灾难性遗忘问题。
English: This study challenges the uniform sampling assumption in continual learning's experience replay, demonstrating that non-uniform sampling strategies consistently outperform the baseline across various settings, suggesting adaptive replay policies could better mitigate catastrophic forgetting.

Authors:Yanran Wu, Inez Hua, Yi Ding
Title: Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View
Abstract:
Large language models (LLMs) offer powerful capabilities but come with significant environmental impact, particularly in carbon emissions. Existing studies benchmark carbon emissions but lack a standardized basis for comparison across different model configurations. To address this, we introduce the concept of functional unit (FU) as a standardized basis and develop FUEL, the first FU-based framework for evaluating LLM serving's environmental impact. Through three case studies, we uncover key insights and trade-offs in reducing carbon emissions by optimizing model size, quantization strategy, and hardware choice, paving the way for more sustainable LLM serving. The code is available at https://github.com/jojacola/FUEL.
中文: 研究者提出基于功能单元的FUEL框架,为大型语言模型的环境影响评估建立统一标准,并通过案例研究揭示模型配置优化对降低碳排放的关键作用。
English: The authors propose FUEL, a functional unit-based framework to standardize environmental impact assessments of large language models, demonstrating through case studies how optimizing model configurations can reduce carbon emissions.

Authors:Sayantan Adak, Somnath Banerjee, Rajarshi Mandal, Avik Halder, Sayan Layek, Rima Hazra, Animesh Mukherjee
Title: MemeSense: An Adaptive In-Context Framework for Social Commonsense Driven Meme Moderation
Abstract:
Memes present unique moderation challenges due to their subtle, multimodal interplay of images, text, and social context. Standard systems relying predominantly on explicit textual cues often overlook harmful content camouflaged by irony, symbolism, or cultural references. To address this gap, we introduce MemeSense, an adaptive in-context learning framework that fuses social commonsense reasoning with visually and semantically related reference examples. By encoding crucial task information into a learnable cognitive shift vector, MemeSense effectively balances lexical, visual, and ethical considerations, enabling precise yet context-aware meme intervention. Extensive evaluations on a curated set of implicitly harmful memes demonstrate that MemeSense substantially outperforms strong baselines, paving the way for safer online communities. Code and data available at: https://github.com/sayantan11995/MemeSense
中文: MemeSense提出了一种自适应框架,融合社会常识推理与多模态参考,有效检测并干预标准系统难以识别的有害模因,显著提升了内容审核的效果。
English: MemeSense introduces an adaptive framework that combines social commonsense reasoning with multimodal references to effectively detect and intervene in harmful memes overlooked by standard systems, significantly improving moderation performance.

Authors:Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen
Title: How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
Abstract:
Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.
Chinese: 本研究通过知识回路演化的视角,揭示了大语言模型获取新知识受其与已有知识相关性影响,遵循从深层到浅层的模式,并经历形成到优化的阶段转变,为改进持续预训练策略提供了理论依据。
English: This study investigates how Large Language Models structurally embed new knowledge through knowledge circuit evolution, revealing that acquisition depends on relevance to existing knowledge, follows a deep-to-shallow pattern, and shifts from formation to optimization, offering insights to improve continual pre-training strategies.

Authors:Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
Title: ReLearn: Unlearning via Learning for Large Language Models
Abstract:
Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.
中文摘要:提出的ReLearn方法通过数据增强和微调有效实现大语言模型的定向遗忘,同时保持输出质量,优于会破坏语言连贯性的反向优化方法。
English Summary: The proposed ReLearn method effectively achieves targeted unlearning in large language models through data augmentation and fine-tuning while maintaining output quality, outperforming reverse optimization approaches that compromise linguistic coherence.

Authors:Ante Wang, Linfeng Song, Ye Tian, Dian Yu, Haitao Mi, Xiangyu Duan, Zhaopeng Tu, Jinsong Su, Dong Yu
Title: Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls
Abstract:
Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs), but at the cost of increased computational resources. In this work, we identify two key challenges contributing to this inefficiency: $\textit{over-exploration}$ due to redundant states with semantically equivalent content, and $\textit{under-exploration}$ caused by high variance in verifier scoring leading to frequent trajectory switching. To address these issues, we propose FETCH, an e$\textbf{f}$fici$\textbf{e}$nt $\textbf{t}$ree sear$\textbf{ch}$ framework, which is a flexible, plug-and-play system compatible with various tree search algorithms. Our framework mitigates over-exploration by merging semantically similar states using agglomerative clustering of text embeddings obtained from a fine-tuned SimCSE model. To tackle under-exploration, we enhance verifiers by incorporating temporal difference learning with adjusted $λ$-returns during training to reduce variance, and employing a verifier ensemble to aggregate scores during inference. Experiments on GSM8K, GSM-Plus, and MATH datasets demonstrate that our methods significantly improve reasoning accuracy and computational efficiency across four different tree search algorithms, paving the way for more practical applications of LLM-based reasoning. The code is available at https://github.com/Soistesimmer/Fetch.
中文: FETCH框架通过聚合语义相似状态减少过度探索,并利用时序差分学习增强验证器以解决探索不足,从而显著提升大语言模型的推理准确性和计算效率。
English: The FETCH framework enhances tree search efficiency in large language models by merging semantically similar states to reduce over-exploration and improving verifier reliability with temporal difference learning to address under-exploration, thereby boosting both reasoning accuracy and computational performance.

Authors:Shilong Yang, Qi Zang, Chulong Zhang, Lingfeng Huang, Yaoqin Xie
Title: RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer
Abstract:
Traditional Chinese acupuncture methods often face controversy in clinical practice due to their high subjectivity. Additionally, current intelligent-assisted acupuncture systems have two major limitations: slow acupoint localization speed and low accuracy. To address these limitations, a new method leverages the excellent inference efficiency of the state-space model Mamba, while retaining the advantages of the attention mechanism in the traditional DETR architecture, to achieve efficient global information integration and provide high-quality feature information for acupoint localization tasks. Furthermore, by employing the concept of residual likelihood estimation, it eliminates the need for complex upsampling processes, thereby accelerating the acupoint localization task. Our method achieved state-of-the-art (SOTA) accuracy on a private dataset of acupoints on the human back, with an average Euclidean distance pixel error (EPE) of 7.792 and an average time consumption of 10.05 milliseconds per localization task. Compared to the second-best algorithm, our method improved both accuracy and speed by approximately 14\%. This significant advancement not only enhances the efficacy of acupuncture treatment but also demonstrates the commercial potential of automated acupuncture robot systems. Access to our method is available at https://github.com/Sohyu1/RT-DEMT
中文:新型针灸定位方法融合Mamba推理效率与DETR注意力机制,通过残差似然估计避免复杂上采样,在背部穴位数据集上以7.792像素误差和10.05毫秒单次定位速度达到最优精度。
English: A novel acupuncture localization method combining Mamba's inference efficiency with DETR's attention mechanism achieves state-of-the-art accuracy (7.792px error) and speed (10.05ms per task) while eliminating complex upsampling through residual likelihood estimation.

Authors:Tianshi Zheng, Jiayang Cheng, Chunyang Li, Haochen Shi, Zihao Wang, Jiaxin Bai, Yangqiu Song, Ginny Y. Wong, Simon See
Title: LogiDynamics: Unraveling the Dynamics of Inductive, Abductive and Deductive Logical Inferences in LLM Reasoning
Abstract:
Modern large language models (LLMs) employ diverse logical inference mechanisms for reasoning, making the strategic optimization of these approaches critical for advancing their capabilities. This paper systematically investigate the comparative dynamics of inductive (System 1) versus abductive/deductive (System 2) inference in LLMs. We utilize a controlled analogical reasoning environment, varying modality (textual, visual, symbolic), difficulty, and task format (MCQ / free-text). Our analysis reveals System 2 pipelines generally excel, particularly in visual/symbolic modalities and harder tasks, while System 1 is competitive for textual and easier problems. Crucially, task format significantly influences their relative advantage, with System 1 sometimes outperforming System 2 in free-text rule-execution. These core findings generalize to broader in-context learning. Furthermore, we demonstrate that advanced System 2 strategies like hypothesis selection and iterative refinement can substantially scale LLM reasoning. This study offers foundational insights and actionable guidelines for strategically deploying logical inference to enhance LLM reasoning. Resources are available at https://github.com/HKUST-KnowComp/LogiDynamics.
中文摘要:本研究表明,在复杂视觉/符号任务中系统2推理通常优于系统1,而系统1在简单文本任务中仍具竞争力,且任务格式显著影响两者的相对表现优势。
English Summary: This study demonstrates that System 2 reasoning generally outperforms System 1 in complex visual/symbolic tasks, while System 1 remains competitive in simpler textual tasks, with task format significantly influencing their relative performance.

Authors:Jeonghyun Park, Hwanhee Lee
Title: Investigating Language Preference of Multilingual RAG Systems
Abstract:
Multilingual Retrieval-Augmented Generation (mRAG) systems enhance language models by integrating external multilingual information to produce context-aware responses. However, mRAG systems struggle with retrieving relevant information due to linguistic variations between queries and documents, generating inconsistent responses when multilingual sources conflict. In this work, we systematically investigate language preferences in both retrieval and generation of mRAG through a series of experiments. Our analysis indicates that retrievers tend to prefer high-resource and query languages, yet this preference does not consistently improve generation performance. Moreover, we observe that generators prefer the query language or Latin scripts, leading to inconsistent outputs. To overcome these issues, we propose Dual Knowledge Multilingual RAG (DKM-RAG), a simple yet effective framework that fuses translated multilingual passages with complementary model knowledge. Empirical results demonstrate that DKM-RAG mitigates language preference in generation and enhances performance across diverse linguistic settings. Code is available at https://github.com/jeonghyunpark2002/LanguagePreference.git
中文:多语言检索增强生成系统因查询与文档间的语言差异及多语言源冲突而难以检索相关信息并产生不一致响应,为此提出的双重知识多语言RAG框架融合翻译段落与模型知识,有效缓解语言偏好并提升跨语言性能。
English: Multilingual Retrieval-Augmented Generation (mRAG) systems face challenges in retrieving relevant information and generating consistent responses due to linguistic variations and conflicting sources, which are addressed by the proposed Dual Knowledge mRAG (DKM-RAG) framework that fuses translated passages with model knowledge to improve performance across languages.

Authors:Bohan Lyu, Siqiao Huang, Zichen Liang, Qi-An Sun, Jiaming Zhang
Title: SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Abstract:
Neural surrogate models have emerged as powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks. We investigate a novel application: using LLMs as surrogate models for code execution prediction. Given LLMs' unique ability to understand and process diverse programs, they present a promising direction for building general-purpose surrogate models. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive empirical analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes, with implications for automated software testing, program analysis, and computational resource optimization in data mining applications. Code and dataset are released at https://github.com/Imbernoulli/SURGE.
中文摘要:本研究通过SURGE基准系统评估大语言模型在代码执行预测中作为神经代理模型的可行性,涵盖多语言编程、竞赛题目等八大维度,对21个模型的测试揭示了其在计算过程中替代作用的重要潜力。
English Summary: The study introduces SURGE, a benchmark evaluating whether large language models (LLMs) can effectively serve as neural surrogate models for code execution prediction across diverse programming scenarios, revealing key insights about their feasibility through comprehensive testing of 21 models.

Authors:Bohan Lyu, Siqiao Huang, Zichen Liang, Qi-An Sun, Jiaming Zhang
Title: SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Abstract:
Neural surrogate models are powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as generation and understanding. However, an equally important yet underexplored question is whether LLMs can serve as surrogate models for code execution prediction. To systematically investigate it, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes. The benchmark and evaluation framework are available at https://github.com/Imbernoulli/SURGE.
中文摘要:本研究通过SURGE基准系统评估大语言模型在代码执行预测中作为神经代理模型的可行性,涵盖多语言编程、竞赛题目等八大维度,对21个模型的测试揭示了其在计算过程中替代作用的重要潜力。
English Summary: The study introduces SURGE, a benchmark evaluating whether large language models (LLMs) can effectively serve as neural surrogate models for code execution prediction across diverse programming scenarios, revealing key insights about their feasibility through comprehensive testing of 21 models.

Authors:Jingyuan Huang, Jen-tse Huang, Ziyi Liu, Xiaoyuan Liu, Wenxuan Wang, Jieyu Zhao
Title: AI Sees Your Location, But With A Bias Toward The Wealthy World
Abstract:
Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, VLMs still show regional biases in this task. To systematically evaluate these issues, we introduce a benchmark consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to 53.8% accuracy in city prediction, they exhibit significant biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed (-12.5%) and sparsely populated (-17.0%) areas. Moreover, regional biases of frequently over-predicting certain locations remain. For instance, they consistently predict Sydney for images taken in Australia, shown by the low entropy scores for these countries. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.
中文: 视觉语言模型在识别图像地理信息方面表现出显著准确性,但存在明显区域偏见,偏向发达和人口稠密地区,同时引发了对在线图像分享隐私问题的担忧。
English: Visual-Language Models demonstrate notable accuracy in identifying geographic details from images but exhibit significant regional biases, favoring developed and densely populated areas while raising privacy concerns for online image sharing.

Authors:Yanwei Yue, Guibin Zhang, Boyang Liu, Guancheng Wan, Kun Wang, Dawei Cheng, Yiyan Qi
Title: MasRouter: Learning to Route LLMs for Multi-Agent Systems
Abstract:
Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynamic LLM selection. Current LLM routing methods effectively reduce overhead in single-agent scenarios by customizing LLM selection for each query, but they overlook the critical decisions regarding collaboration modes and agent roles in MAS. In response to this challenge, we first introduce the problem of Multi-Agent System Routing (MASR), which integrates all components of MAS into a unified routing framework. Toward this goal, we propose MasRouter, the first high-performing, cost-effective, and inductive MASR solution. MasRouter employs collaboration mode determination, role allocation, and LLM routing through a cascaded controller network, progressively constructing a MAS that balances effectiveness and efficiency. Extensive experiments demonstrate that MasRouter is (1) high-performing, achieving a $1.8\%\sim8.2\%$ improvement over the state-of-the-art method on MBPP; (2) economical, reducing overhead by up to $52.07\%$ compared to SOTA methods on HumanEval; and (3) plug-and-play, seamlessly integrating with mainstream MAS frameworks, reducing overhead by $17.21\%\sim28.17\%$ via customized routing. The code is available at https://github.com/yanweiyue/masrouter.
中文摘要:MasRouter提出了一种多智能体系统统一路由框架,通过协作模式决策、角色分配和大语言模型路由的级联控制,在提升系统性能的同时大幅降低了运行开销。
English Summary: MasRouter introduces a unified routing framework for multi-agent systems that optimizes collaboration modes, role allocation, and LLM selection to enhance performance while significantly reducing computational costs.

Authors:Yuqi Liu, Yan Zheng
Title: Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM
Abstract:
Given the rapid development of Legal AI, a lot of attention has been paid to one of the most important legal AI tasks--similar case retrieval, especially with language models to use. In our paper, however, we try to improve the ranking performance of current models from the perspective of learning to rank instead of language models. Specifically, we conduct experiments using a pairwise method--RankSVM as the classifier to substitute a fully connected layer, combined with commonly used language models on similar case retrieval datasets LeCaRDv1 and LeCaRDv2. We finally come to the conclusion that RankSVM could generally help improve the retrieval performance on the LeCaRDv1 and LeCaRDv2 datasets compared with original classifiers by optimizing the precise ranking. It could also help mitigate overfitting owing to class imbalance. Our code is available in https://github.com/liuyuqi123study/RankSVM_for_SLR
中文摘要:本文通过在学习排序框架中用RankSVM替代传统分类器,显著提升了相似案例检索在LeCaRD数据集上的排序精度并缓解了过拟合问题。
English Summary: This paper enhances similar case retrieval performance by replacing traditional classifiers with RankSVM in learning-to-rank frameworks, demonstrating improved ranking accuracy and reduced overfitting on LeCaRD datasets.

Authors:Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, Yang Wang
Title: G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems
Abstract:
Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees. The code is available at https://github.com/wslong20/G-safeguard.
中文: G-Safeguard是一种基于拓扑引导的安全框架,通过图神经网络检测异常并实施拓扑干预,能有效提升基于大语言模型的多智能体系统对抗各类攻击的鲁棒性,同时保持与不同系统架构的兼容性。
English: G-Safeguard is a topology-guided security framework that uses graph neural networks to detect anomalies and apply topological interventions, effectively enhancing the robustness of LLM-based multi-agent systems against various attacks while maintaining compatibility with diverse system architectures.

Authors:Zongyuan Li, Chang Lu, Xiaojie Xu, Runnan Qi, Yanan Ni, Lumin Jiang, Xiangbei Liu, Xuebo Zhang, Yongchun Fang, Kuihua Huang, Xian Guo
Title: Hierarchical Expert Prompt for Large-Language-Model: An Approach Defeat Elite AI in TextStarCraft II for the First Time
Abstract:
Since the emergence of the Large Language Model (LLM), LLM has been widely used in fields such as writing, translating, and searching. However, there is still great potential for LLM-based methods in handling complex tasks such as decision-making in the StarCraft II environment. To address problems such as lack of relevant knowledge and poor control over subtasks of varying importance, we propose a Hierarchical Expert Prompt (HEP) for LLM. Our method improves the understanding of game situations through expert-level tactical knowledge, improving the processing quality of tasks of varying importance through a hierarchical framework. Our approach defeated the highest level (Elite) standard built-in agent in TextStarCraft II for the first time and consistently outperformed the baseline method in other difficulties. Our experiments suggest that the proposed method is a practical solution for tackling complex decision-making challenges. The replay video can be viewed on https://www.bilibili.com/video/BV1uz42187EF and https://youtu.be/dO3PshWLV5M, and our codes have been open-sourced on https://github.com/luchang1113/HEP-LLM-play-StarCraftII.
Chinese: 提出的分层专家提示方法通过引入专家知识和分层框架,增强了大型语言模型在《星际争霸II》等复杂环境中的决策能力,实现了对精英级智能体的卓越表现。
English: The proposed Hierarchical Expert Prompt method enhances LLM's decision-making in complex environments like StarCraft II by incorporating expert knowledge and a hierarchical framework, achieving superior performance against elite-level agents.

Authors:Yu Cui, Hang Fu, Licheng Wang, Haibin Zhang
Title: Ramp Up NTT in Record Time using GPU-Accelerated Algorithms and LLM-based Code Generation
Abstract:
Homomorphic encryption (HE) is a core building block in privacy-preserving machine learning (PPML), but HE is also widely known as its efficiency bottleneck. Therefore, many GPU-accelerated cryptographic schemes have been proposed to improve the performance of HE. However, these methods often require complex modifications tailored to specific algorithms and are tightly coupled with specific GPU and operating systems. It is interesting to ask how to generally offer more practical GPU-accelerated cryptographic algorithm implementations. Given the powerful code generation capabilities of large language models (LLMs), we aim to explore their potential to automatically generate practical GPU-friendly algorithm code using CPU-friendly code. In this paper, we focus on number theoretic transform (NTT) -- the core mechanism of HE. We first develop and optimize a GPU-friendly NTT (GNTT) family that exploits PyTorch's fast matrix computation and precomputation, achieving an approximately 62x speedup -- a significant boost over existing ones. Then we explore GPU-friendly code generation using various LLMs, including DeepSeek-R1, OpenAI o1 and o3-mini. We discover many interesting findings throughout the process. For instance, somewhat surprisingly, our experiments demonstrate that DeepSeek-R1 significantly outperforms OpenAI o3-mini and o1, but still cannot beat our optimized protocol. The findings provide valuable insights for turbocharging PPML and enhancing code generation capabilities of LLMs. Codes are available at: https://github.com/LMPC-Lab/GenGPUCrypto.
中文: 同态加密作为隐私保护机器学习中的效率瓶颈,本文通过开发GPU优化的数论变换实现62倍加速,并探索大型语言模型自动生成GPU友好代码的潜力,发现DeepSeek-R1虽优于其他模型但仍未超越人工优化方案。
English: Homomorphic encryption's efficiency bottleneck in privacy-preserving machine learning is addressed by developing a GPU-optimized number theoretic transform that achieves 62x speedup and exploring LLMs' potential for automated GPU-friendly code generation, with DeepSeek-R1 outperforming other models though not surpassing manually optimized protocols.

Authors:Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Title: Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping
Abstract:
Knowledge Distillation (KD) has emerged as a prominent technique for model compression. However, conventional KD approaches primarily focus on homogeneous architectures with identical tokenizers, constraining their applicability in cross-architecture scenarios. As for the cross-tokenizer KD, the differences in the tokenizers give rise to two fundamental challenges: (1) sequence misalignment caused by divergent tokenization strategies, and (2) mismatched vocabulary size and composition. While existing probability-matching methods attempt to address these issues, their efficacy remains limited due to suboptimal alignment in both the sequence and vocabulary aspects. To overcome these limitations, we propose Contextual Dynamic Mapping (CDM), a novel cross-tokenizer distillation framework that employs contextual information to enhance sequence alignment precision and dynamically improves vocabulary mapping. We evaluated the effectiveness of our approach across five advanced and widely-used model families (i.e, LLama3, Phi3, Gemma2, OPT and Qwen2), which were configured into three distinct teacher-student pairs. Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks, including instruction-following, code generation and math. Notably, our analysis reveals that combining conventional same-tokenizer distillation and cross-tokenizer distillation through CDM yields further performance improvements. The code is available at https://github.com/pppa2019/ContexualDynamicMapping
中文摘要:知识蒸馏在跨分词器场景下面临序列不对齐和词汇不匹配的挑战,而提出的上下文动态映射(CDM)框架通过增强对齐精度和动态词汇映射,在多种模型家族中实现了显著性能提升。
English Summary: Knowledge distillation faces challenges in cross-tokenizer scenarios due to sequence misalignment and vocabulary mismatch, which the proposed Contextual Dynamic Mapping (CDM) framework addresses by enhancing alignment precision and dynamic vocabulary mapping across multiple model families.

Authors:Hongliang Lu, Zhonglin Xie, Yaoyu Wu, Can Ren, Yuxuan Chen, Zaiwen Wen
Title: OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling
Abstract:
Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach. Our dataset is publicly available at https://github.com/AuroraLHL/OptMATH.
中文摘要:OptMATH框架通过自动生成可控复杂度的数据并验证其与自然语言的对应关系,解决了高质量优化建模数据集稀缺的问题,使大语言模型在多个基准测试中实现卓越性能。
English Summary: The OptMATH framework addresses the scarcity of high-quality optimization modeling datasets by automatically generating controllable complexity data with verified natural language correspondences, enabling LLMs to achieve superior performance across benchmarks.

Authors:Zhao Wang, Sota Moriyama, Wei-Yao Wang, Briti Gangopadhyay, Shingo Takamatsu
Title: Talk Structurally, Act Hierarchically: A Collaborative Framework for LLM Multi-Agent Systems
Abstract:
Recent advancements in LLM-based multi-agent (LLM-MA) systems have shown promise, yet significant challenges remain in managing communication and refinement when agents collaborate on complex tasks. In this paper, we propose \textit{Talk Structurally, Act Hierarchically (TalkHier)}, a novel framework that introduces a structured communication protocol for context-rich exchanges and a hierarchical refinement system to address issues such as incorrect outputs, falsehoods, and biases. \textit{TalkHier} surpasses various types of SoTA, including inference scaling model (OpenAI-o1), open-source multi-agent models (e.g., AgentVerse), and majority voting strategies on current LLM and single-agent baselines (e.g., ReAct, GPT4o), across diverse tasks, including open-domain question answering, domain-specific selective questioning, and practical advertisement text generation. These results highlight its potential to set a new standard for LLM-MA systems, paving the way for more effective, adaptable, and collaborative multi-agent frameworks. The code is available https://github.com/sony/talkhier.
中文:提出的TalkHier框架通过结构化通信和分层优化机制解决了多智能体系统协作中的关键问题,在多项任务中超越现有先进模型,为高效协作确立了新标准。
English: The proposed TalkHier framework introduces structured communication and hierarchical refinement to overcome collaboration challenges in LLM-based multi-agent systems, outperforming state-of-the-art models across diverse tasks and setting a new standard for effective multi-agent collaboration.

Authors:Yuting Huang, Chengyuan Liu, Yifeng Feng, Yiquan Wu, Chao Wu, Fei Wu, Kun Kuang
Title: Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction
Abstract:
As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing jailbreak methods create a forced instruction-following scenario, or search adversarial prompts with prefix or suffix tokens to achieve a specific representation manually or automatically. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we point out that simply rewriting the original instruction can achieve a jailbreak, and we find that this rewriting approach is learnable and transferable. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. The jailbreak is more efficient and hard to identify since no additional features are introduced. Extensive experiments and analysis demonstrate the effectiveness of R2J, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety. The code can be found at https://github.com/ythuang02/R2J/.
The paper introduces Rewrite to Jailbreak (R2J), a transferable black-box method that efficiently attacks Large Language Models by iteratively rewriting instructions to exploit model weaknesses without introducing detectable patterns.
English Summary:

Authors:Haoyang Li, Xuejia Chen, Zhanchao XU, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, Lei Chen
Title: Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface-level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem-solving, neglecting fundamental numerical reasoning required in real-world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and logical reasoning. NumericBench includes datasets ranging from synthetic number lists to the crawled real-world data, addressing challenges like long contexts, noise, and multi-step reasoning. Extensive experiments on state-of-the-art LLMs, including GPT-4 and DeepSeek, reveal persistent weaknesses in numerical reasoning, highlighting the urgent need to improve numerically-aware language modeling. The benchmark is released in: https://github.com/TreeAI-Lab/NumericBench.
Chinese: 大语言模型在语言任务上表现出色,但在数值推理方面存在明显不足,因其依赖表层统计模式,为此我们提出NumericBench基准来评估六项核心数值能力,并揭示GPT-4等模型的持续缺陷。
English: Large Language Models excel in linguistic tasks but struggle with numerical reasoning due to their reliance on surface patterns, prompting the creation of NumericBench to evaluate six core numerical skills and reveal persistent weaknesses in models like GPT-4.

Authors:Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao
Title: Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
Abstract:
Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.
中文摘要:提出的推理增强对话(RACE)框架通过将有害查询重构为良性推理任务,在多轮越狱攻击中实现了最先进的攻击效果,对主流大模型的攻击成功率最高提升96%。
English Summary: The proposed Reasoning-Augmented Conversation (RACE) framework enhances multi-turn jailbreak attacks by transforming harmful queries into benign reasoning tasks, achieving state-of-the-art effectiveness with up to 96% higher success rates against leading LLMs.

Authors:Jiahao Huo, Yibo Yan, Xu Zheng, Yuanhuiyi Lyu, Xin Zou, Zhihua Wei, Xuming Hu
Title: MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models
Abstract:
Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to reformulate the task of multimodal MU in the era of MLLMs, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we develop a novel geometry-constrained gradient ascent method MMUnlearner. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code can be found in [this URL](https://github.com/Z1zs/MMUnlearner).
中文: 本研究提出MMUnlearner这一多模态机器遗忘新方法,能在多模态大语言模型中选择性消除特定实体的视觉模式同时保留文本知识,在所有评估维度上均优于现有技术。
English: This study introduces MMUnlearner, a novel method for multimodal machine unlearning that selectively erases visual patterns of specific entities in MLLMs while preserving textual knowledge, outperforming existing techniques across all evaluation metrics.

Authors:Mohammad Mehdi Hosseini, Ali Pourramezan Fard, Mohammad H. Mahoor
Title: Faces of Fairness: Examining Bias in Facial Expression Recognition Datasets and Models
Abstract:
Building AI systems, including Facial Expression Recognition (FER), involves two critical aspects: data and model design. Both components significantly influence bias and fairness in FER tasks. Issues related to bias and fairness in FER datasets and models remain underexplored. This study investigates bias sources in FER datasets and models. Four common FER datasets--AffectNet, ExpW, Fer2013, and RAF-DB--are analyzed. The findings demonstrate that AffectNet and ExpW exhibit high generalizability despite data imbalances. Additionally, this research evaluates the bias and fairness of six deep models, including three state-of-the-art convolutional neural network (CNN) models: MobileNet, ResNet, XceptionNet, as well as three transformer-based models: ViT, CLIP, and GPT-4o-mini. Experimental results reveal that while GPT-4o-mini and ViT achieve the highest accuracy scores, they also display the highest levels of bias. These findings underscore the urgent need for developing new methodologies to mitigate bias and ensure fairness in datasets and models, particularly in affective computing applications. See our implementation details at https://github.com/MMHosseini/bias_in_FER.
中文摘要:本研究分析了四种面部表情识别数据集和六种深度学习模型的偏见与公平性,发现GPT-4o-mini和ViT虽获得最高准确率,但表现出最强偏见,凸显了在情感计算领域减少偏见的迫切需求。
English Summary: This study analyzes bias and fairness in four FER datasets and six deep learning models, finding that while GPT-4o-mini and ViT achieve top accuracy, they exhibit the highest bias, highlighting the need for bias mitigation in affective computing.

Authors:Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, Pan Zhou
Title: GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
Abstract:
Speculative decoding accelerates inference in large language models (LLMs) by generating multiple draft tokens simultaneously. However, existing methods often struggle with token misalignment between the training and decoding phases, limiting their performance. To address this, we propose GRIFFIN, a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model to mitigate misalignment. The training strategy employs a loss masking mechanism to exclude highly misaligned tokens during training, preventing them from negatively impacting the draft model's optimization. The token-alignable draft model introduces input tokens to correct inconsistencies in generated features. Experiments on LLaMA, Vicuna, Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 8% and a speedup ratio exceeding 7%, outperforming current speculative decoding state-of-the-art methods. Our code and GRIFFIN's draft models are released publicly in https://github.com/hsj576/GRIFFIN.
中文: GRIFFIN通过可对齐令牌的训练策略和草稿模型解决推测解码中的错位问题,在多个大语言模型上实现超过8%的接受长度提升和7%以上的加速效果。
English: GRIFFIN introduces a token-alignable training strategy and draft model to mitigate misalignment in speculative decoding, achieving over 8% improvement in acceptance length and exceeding 7% speedup across multiple LLMs.

Authors:Jiuwu Hao, Liguo Sun, Ti Xiang, Yuting Wan, Haolin Song, Pin Lv
Title: FeaKM: Robust Collaborative Perception under Noisy Pose Conditions
Abstract:
Collaborative perception is essential for networks of agents with limited sensing capabilities, enabling them to work together by exchanging information to achieve a robust and comprehensive understanding of their environment. However, localization inaccuracies often lead to significant spatial message displacement, which undermines the effectiveness of these collaborative efforts. To tackle this challenge, we introduce FeaKM, a novel method that employs Feature-level Keypoints Matching to effectively correct pose discrepancies among collaborating agents. Our approach begins by utilizing a confidence map to identify and extract salient points from intermediate feature representations, allowing for the computation of their descriptors. This step ensures that the system can focus on the most relevant information, enhancing the matching process. We then implement a target-matching strategy that generates an assignment matrix, correlating the keypoints identified by different agents. This is critical for establishing accurate correspondences, which are essential for effective collaboration. Finally, we employ a fine-grained transformation matrix to synchronize the features of all agents and ascertain their relative statuses, ensuring coherent communication among them. Our experimental results demonstrate that FeaKM significantly outperforms existing methods on the DAIR-V2X dataset, confirming its robustness even under severe noise conditions. The code and implementation details are available at https://github.com/uestchjw/FeaKM.
中文摘要:FeaKM是一种通过特征级关键点匹配来校正姿态差异的新型协同感知方法,在DAIR-V2X数据集上显著优于现有方法,即使在严重噪声条件下也表现出强大鲁棒性。
English Summary: FeaKM is a novel collaborative perception method that corrects pose discrepancies through feature-level keypoint matching, significantly outperforming existing approaches on the DAIR-V2X dataset even under severe noise conditions.

Authors:Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han
Title: RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation
Abstract:
Large language models (LLMs) have achieved impressive performance on knowledge-intensive tasks, yet they often struggle with multi-step reasoning due to the unstructured nature of retrieved context. While retrieval-augmented generation (RAG) methods provide external information, the lack of explicit organization among retrieved passages limits their effectiveness, leading to brittle reasoning pathways. Recent interpretability studies highlighting the importance of structured intermediate reasoning further align with this perspective. We propose Retrieval-And-Structuring (RAS), a framework that dynamically constructs query-specific knowledge graphs through iterative retrieval and structured knowledge building. RAS interleaves targeted retrieval planning with incremental graph construction, enabling models to assemble and reason over evolving knowledge structures tailored to each query. On seven knowledge-intensive benchmarks, RAS consistently outperforms strong baselines, achieving up to 6.4% and 7.0% gains with open-source and proprietary LLMs, respectively. Our results demonstrate that dynamic, query-specific knowledge structuring offers a robust path to improving reasoning accuracy and robustness in language model generation. Our data and code can be found at https://github.com/pat-jj/RAS.
Chinese: 提出的检索与结构化(RAS)框架通过迭代检索和结构化动态构建特定查询的知识图谱,在多个基准测试中显著提升了语言模型的推理准确性和鲁棒性。
English: The proposed Retrieval-And-Structuring (RAS) framework dynamically builds query-specific knowledge graphs through iterative retrieval and structuring, significantly enhancing reasoning accuracy and robustness in language models across multiple benchmarks.

Authors:Yixuan Tang, Yi Yang
Title: FinMTEB: Finance Massive Text Embedding Benchmark
Abstract:
Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advances in large language models (LLMs) have further enhanced the performance of embedding models. While these models are often benchmarked on general-purpose datasets, real-world applications demand domain-specific evaluation. In this work, we introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a specialized counterpart to MTEB designed for the financial domain. FinMTEB comprises 64 financial domain-specific embedding datasets across 7 tasks that cover diverse textual types in both Chinese and English, such as financial news articles, corporate annual reports, ESG reports, regulatory filings, and earnings call transcripts. We also develop a finance-adapted model, Fin-E5, using a persona-based data synthetic method to cover diverse financial embedding tasks for training. Through extensive evaluation of 15 embedding models, including Fin-E5, we show three key findings: (1) performance on general-purpose benchmarks shows limited correlation with financial domain tasks; (2) domain-adapted models consistently outperform their general-purpose counterparts; and (3) surprisingly, a simple Bag-of-Words (BoW) approach outperforms sophisticated dense embeddings in financial Semantic Textual Similarity (STS) tasks, underscoring current limitations in dense embedding techniques. Our work establishes a robust evaluation framework for financial NLP applications and provides crucial insights for developing domain-specific embedding models.
中文摘要:本文提出金融领域专用基准FinMTEB,通过评估发现领域适配模型优于通用模型,并揭示稠密嵌入在金融语义相似性任务中的现有局限性。
English Summary: This paper introduces FinMTEB, a specialized financial benchmark for evaluating embedding models, and demonstrates that domain-adapted models outperform general ones while revealing surprising limitations of dense embeddings in financial tasks.

Authors:Arjun Vijaywargiya, Shane A. McQuarrie, Anthony Gruber
Title: Tensor parametric Hamiltonian operator inference
Abstract:
This work presents a tensorial approach to constructing data-driven reduced-order models corresponding to semi-discrete partial differential equations with canonical Hamiltonian structure. By expressing parameter-varying operators with affine dependence as contractions of a generalized parameter vector against a constant tensor, this method leverages the operator inference framework to capture parametric dependence in the learned reduced-order model via the solution to a convex, least-squares optimization problem. This leads to a concise and straightforward implementation which compactifies previous parametric operator inference approaches and directly extends to learning parametric operators with symmetry constraints, a key feature required for constructing structure-preserving surrogates of Hamiltonian systems. The proposed approach is demonstrated on both a (non-Hamiltonian) heat equation with variable diffusion coefficient as well as a Hamiltonian wave equation with variable wave speed.
中文: 本研究提出了一种基于张量的方法,用于构建哈密顿偏微分方程的数据驱动降阶模型,通过凸优化简化参数化算子推断,并实现结构保持的代理建模。
English: This study introduces a tensor-based method for creating data-driven reduced-order models of Hamiltonian partial differential equations, simplifying parametric operator inference through convex optimization and enabling structure-preserving surrogate modeling.

Authors:Zongqian Wu, Tianyu Li, Baoduo Xu, Jiaying Yang, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng
Title: Is Depth All You Need? An Exploration of Iterative Reasoning in LLMs
Abstract:
Deep iterative chain-of-thought (CoT) reasoning enables LLMs to tackle complex tasks by progressively activating relevant pre-trained knowledge. However, it faces challenges in ensuring continual improvement and determining a stopping criterion. In this paper, we investigate whether the relevant knowledge that contributes directly to solving the given question can be activated from the initial reasoning path, thus circumventing the need for iterative refinement. Our experiments reveal that increasing the diversity of initial reasoning paths can achieve comparable or superior performance, a concept we term \textit{breadth reasoning}. However, existing breadth reasoning approaches, such as self-consistency, offer limited diversity. To address this limitation, we propose a simple yet effective method that enhances reasoning breadth by integrating contextual exploration with reduced sampling randomness. Extensive experiments demonstrate that our approach significantly outperforms deep iterative reasoning. Our code is provided in https://github.com/zongqianwu/breadth.
Chinese: 深度迭代思维链推理存在持续改进和停止标准的难题,本文提出广度推理方法,通过结合上下文探索与减少采样随机性来多样化初始推理路径,从而显著提升性能。
English: Deep iterative CoT reasoning struggles with continuous improvement and stopping criteria, but this paper introduces breadth reasoning, which enhances performance by diversifying initial reasoning paths through contextual exploration and reduced sampling randomness.

Authors:Shaoxuan Xu, Menglu Cui, Chengxiang Huang, Hongfa Wang, Di Hu
Title: BalanceBenchmark: A Survey for Multimodal Imbalance Learning
Abstract:
Multimodal learning has gained attention for its capacity to integrate information from different modalities. However, it is often hindered by the multimodal imbalance problem, where certain modality dominates while others remain underutilized. Although recent studies have proposed various methods to alleviate this problem, they lack comprehensive and fair comparisons. In this paper, we systematically categorize various mainstream multimodal imbalance algorithms into four groups based on the strategies they employ to mitigate imbalance. To facilitate a comprehensive evaluation of these methods, we introduce BalanceBenchmark, a benchmark including multiple widely used multidimensional datasets and evaluation metrics from three perspectives: performance, imbalance degree, and complexity. To ensure fair comparisons, we have developed a modular and extensible toolkit that standardizes the experimental workflow across different methods. Based on the experiments using BalanceBenchmark, we have identified several key insights into the characteristics and advantages of different method groups in terms of performance, balance degree and computational complexity. We expect such analysis could inspire more efficient approaches to address the imbalance problem in the future, as well as foundation models. The code of the toolkit is available at https://github.com/GeWu-Lab/BalanceBenchmark.
中文: 本文提出BalanceBenchmark,一个用于系统评估多模态不平衡缓解方法的工具包和基准测试,揭示了不同方法在性能与效率方面的关键特征。
English: This paper introduces BalanceBenchmark, a comprehensive toolkit and benchmark for systematically evaluating multimodal imbalance mitigation methods, revealing key insights into their performance and efficiency.

Authors:Zhigang Fang, Renzhi Chen, Zhijie Yang, Yang Guo, Huadong Dai, Lei Wang
Title: LintLLM: An Open-Source Verilog Linting Framework Based on Large Language Models
Abstract:
Code Linting tools are vital for detecting potential defects in Verilog code. However, the limitations of traditional Linting tools are evident in frequent false positives and redundant defect reports. Recent advancements in large language models (LLM) have introduced new possibilities in this area. In this paper, we propose LintLLM, an open-source Linting framework that utilizes LLMs to detect defects in Verilog code via Prompt of Logic-Tree and Defect Tracker. Furthermore, we create an open-source benchmark using the mutation-based defect injection technique to evaluate LLM's ability in detecting Verilog defects. Experimental results show that o1-mini improves the correct rate by 18.89\% and reduces the false-positive rate by 15.56\% compared with the best-performing EDA tool. Simultaneously, LintLLM operates at less than one-tenth of the cost of commercial EDA tools. This study demonstrates the potential of LLM as an efficient and cost-effective Linting tool for hardware design. The benchmark and experimental results are open-source at URL: https://github.com/fangzhigang32/Static-Verilog-Analysis
中文: 本文提出LintLLM,一种利用大语言模型检测Verilog代码缺陷的开源框架,显著提高了检测准确率并降低了误报率,同时运行成本仅为商用EDA工具的十分之一。
English: This paper introduces LintLLM, an open-source framework that uses large language models to detect defects in Verilog code, significantly improving accuracy and reducing false positives while operating at a fraction of the cost of commercial EDA tools.

Authors:Lei Sheng, Shuai-Shuai Xu, Wei Xie
Title: BASE-SQL: A powerful open source Text-To-SQL baseline approach
Abstract:
The conversion of natural language into SQL language for querying databases (Text-to-SQL) has broad application prospects and has attracted widespread attention. At present, the mainstream Text-to-SQL methods are mainly divided into in-context learning (ICL) based methods and supervised fine-tuning (SFT) based methods. ICL-based methods can achieve relatively good results thanks to the use of the most advanced closed-source models. However, in real-world application scenarios, factors such as data privacy, SQL generation efficiency and cost need to be considered. SFT-based methods have certain advantages. At present, methods based on fine-tuning of open source models lack easy-to-implement and effective (cost-effective) baseline methods. We propose a pipeline-based method using open source model fine-tuning, referred to as BASE-SQL, which includes four components: Schema Linking, Candidate SQL Generate, SQL Revision and SQL Merge Revision. Experimental results show that BASE-SQL uses the open source model Qwen2.5-Coder-32B-Instruct, and achieves an accuracy of 67.47% on the BIRD development set and 88.9% on the Spider test set, which is significantly better than other methods using open source models, and even exceeds several methods using the GPT-4o closed-source model. At the same time, BASE-SQL is easy to implement and highly efficient (on average, only five calls to the large language model are required to generate SQL once). The code will be open sourced at https://github.com/CycloneBoy/base_sql.
中文: BASE-SQL是一种基于开源模型微调的管道式Text-to-SQL方法,在基准测试中表现出优越的准确率,同时具备高效易实现的优势。
English: BASE-SQL is a pipeline-based Text-to-SQL method using open-source model fine-tuning that achieves superior accuracy on benchmark datasets while being efficient and easy to implement.

Authors:Ming Meng, Ke Mu, Yonggui Zhu, Zhe Zhu, Haoyu Sun, Heyang Yan, Zhaoxin Fan
Title: VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS
Abstract:
Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation. Though existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with the Variation-Enhanced Feature Extraction (VEFE) module, which seamlessly incorporates \textcolor{blue}{style-reference} video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ the Variation-Compensation Style Encoder (VCSE), a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, the Variation-Driven Gesture Predictor (VDGP) module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness. The code and video results will be made publicly available upon acceptance:https://github.com/mookerr/VarGES/ .
中文摘要:VarGes是一种新颖的框架,通过将视觉风格线索与音频输入相结合来增强伴随语音的手势生成,在3D人体手势的多样性和自然度方面优于现有方法。
English Summary: VarGes is a novel framework that enhances co-speech gesture generation by integrating visual style cues with audio inputs, achieving superior diversity and naturalness in 3D human gestures compared to existing methods.

Authors:Xiliang Yang, Shenyang Deng, Shicong Liu, Yuanchi Suo, Wing. W. Y NG, Jianjun Zhang
Title: A Mathematics Framework of Artificial Shifted Population Risk and Its Further Understanding Related to Consistency Regularization
Abstract:
Data augmentation is an important technique in training deep neural networks as it enhances their ability to generalize and remain robust. While data augmentation is commonly used to expand the sample size and act as a consistency regularization term, there is a lack of research on the relationship between them. To address this gap, this paper introduces a more comprehensive mathematical framework for data augmentation. Through this framework, we establish that the expected risk of the shifted population is the sum of the original population risk and a gap term, which can be interpreted as a consistency regularization term. The paper also provides a theoretical understanding of this gap, highlighting its negative effects on the early stages of training. We also propose a method to mitigate these effects. To validate our approach, we conducted experiments using same data augmentation techniques and computing resources under several scenarios, including standard training, out-of-distribution, and imbalanced classification. The results demonstrate that our methods surpass compared methods under all scenarios in terms of generalization ability and convergence stability. We provide our code implementation at the following link: https://github.com/ydlsfhll/ASPR.
中文摘要:本文提出了一个数学框架,揭示数据增强在扩大样本量和作为一致性正则化项方面的双重作用,并提出了一种在多种场景下超越现有方法的新方法。
English Summary: This paper introduces a mathematical framework revealing data augmentation's dual role in expanding sample size and serving as a consistency regularization term, proposing a method that outperforms existing approaches across multiple scenarios.

Authors:Xiangfei Qiu, Hanyin Cheng, Xingjian Wu, Jilin Hu, Chenjuan Guo, Bin Yang
Title: A Comprehensive Survey of Deep Learning for Multivariate Time Series Forecasting: A Channel Strategy Perspective
Abstract:
Multivariate Time Series Forecasting (MTSF) plays a crucial role across diverse fields, ranging from economic, energy, to traffic. In recent years, deep learning has demonstrated outstanding performance in MTSF tasks. In MTSF, modeling the correlations among different channels is critical, as leveraging information from other related channels can significantly improve the prediction accuracy of a specific channel. This study systematically reviews the channel modeling strategies for time series and proposes a taxonomy organized into three hierarchical levels: the strategy perspective, the mechanism perspective, and the characteristic perspective. On this basis, we provide a structured analysis of these methods and conduct an in-depth examination of the advantages and limitations of different channel strategies. Finally, we summarize and discuss some future research directions to provide useful research guidance. Moreover, we maintain an up-to-date Github repository (https://github.com/decisionintelligence/CS4TS) which includes all the papers discussed in the survey.
中文: 本研究系统回顾了多元时间序列预测中的通道建模策略,提出三层分类框架并分析不同方法的优劣,同时总结了未来研究方向并维护了相关GitHub资源库。
English: This study systematically reviews and categorizes channel modeling strategies in multivariate time series forecasting into three hierarchical perspectives, analyzing their advantages and limitations while outlining future research directions and maintaining a relevant GitHub repository.

Authors:Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang
Title: An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Abstract:
As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. The code and data are released at: https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty.
中文: 本研究探讨了大型语言模型评估者的不确定性,发现模型系列和规模影响稳定性,并提出ConfiLM,一种通过不确定性信息微调的感知不确定性的评估器,以提升在分布外场景下的评估性能。
English: This study investigates the uncertainty in LLM evaluators, finding that model families and sizes affect stability, and proposes ConfiLM, an uncertainty-aware evaluator fine-tuned with uncertainty information to improve performance in out-of-distribution scenarios.

Authors:Zirui Song, Bin Yan, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, Xiuying Chen
Title: Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey
Abstract:
Large Language Models (LLMs) have demonstrated remarkable success in various tasks such as natural language understanding, text summarization, and machine translation. However, their general-purpose nature often limits their effectiveness in domain-specific applications that require specialized knowledge, such as healthcare, chemistry, or legal analysis. To address this, researchers have explored diverse methods to enhance LLMs by integrating domain-specific knowledge. In this survey, we provide a comprehensive overview of these methods, which we categorize into four key approaches: dynamic knowledge injection, static knowledge embedding, modular adapters, and prompt optimization. Each approach offers unique mechanisms to equip LLMs with domain expertise, balancing trade-offs between flexibility, scalability, and efficiency. We discuss how these methods enable LLMs to tackle specialized tasks, compare their advantages and disadvantages, evaluate domain-specific LLMs against general LLMs, and highlight the challenges and opportunities in this emerging field. For those interested in delving deeper into this area, we also summarize the commonly used datasets and benchmarks. To keep researchers updated on the latest studies, we maintain an open-source at: https://github.com/abilliyb/Knowledge_Injection_Survey_Papers, dedicated to documenting research in the field of specialized LLM.
中文摘要:大语言模型在通用任务中表现出色,但在专业领域应用中受限,因此研究者探索了动态知识注入、静态知识嵌入、模块适配器和提示优化四种核心方法,以增强其领域专业知识,同时权衡灵活性、可扩展性和效率。
English Summary: Large Language Models excel in general tasks but struggle with domain-specific applications, prompting researchers to develop four key methods—dynamic knowledge injection, static knowledge embedding, modular adapters, and prompt optimization—to enhance their specialized knowledge while balancing flexibility, scalability, and efficiency.

Authors:Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong
Title: Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model
Abstract:
Electrocardiogram (ECG) is essential for the clinical diagnosis of arrhythmias and other heart diseases, but deep learning methods based on ECG often face limitations due to the need for high-quality annotations. Although previous ECG self-supervised learning (eSSL) methods have made significant progress in representation learning from unannotated ECG data, they typically treat ECG signals as ordinary time-series data, segmenting the signals using fixed-size and fixed-step time windows, which often ignore the form and rhythm characteristics and latent semantic relationships in ECG signals. In this work, we introduce a novel perspective on ECG signals, treating heartbeats as words and rhythms as sentences. Based on this perspective, we first designed the QRS-Tokenizer, which generates semantically meaningful ECG sentences from the raw ECG signals. Building on these, we then propose HeartLang, a novel self-supervised learning framework for ECG language processing, learning general representations at form and rhythm levels. Additionally, we construct the largest heartbeat-based ECG vocabulary to date, which will further advance the development of ECG language processing. We evaluated HeartLang across six public ECG datasets, where it demonstrated robust competitiveness against other eSSL methods. Our data and code are publicly available at https://github.com/PKUDigitalHealth/HeartLang.
中文: 本研究提出HeartLang框架,将心搏视为单词、节律视为句子进行心电语言自监督学习,在六个数据集上表现优异,解决了以往方法忽视心电信号形态节律特征的问题。
English: This study introduces HeartLang, a self-supervised learning framework that treats heartbeats as words and rhythms as sentences to learn ECG representations, demonstrating strong performance across six datasets while addressing limitations of prior methods that overlooked ECG-specific characteristics.

Authors:Mingyang Zhao, Gaofeng Meng, Dong-Ming Yan
Title: Occlusion-aware Non-Rigid Point Cloud Registration via Unsupervised Neural Deformation Correntropy
Abstract:
Non-rigid alignment of point clouds is crucial for scene understanding, reconstruction, and various computer vision and robotics tasks. Recent advancements in implicit deformation networks for non-rigid registration have significantly reduced the reliance on large amounts of annotated training data. However, existing state-of-the-art methods still face challenges in handling occlusion scenarios. To address this issue, this paper introduces an innovative unsupervised method called Occlusion-Aware Registration (OAR) for non-rigidly aligning point clouds. The key innovation of our method lies in the utilization of the adaptive correntropy function as a localized similarity measure, enabling us to treat individual points distinctly. In contrast to previous approaches that solely minimize overall deviations between two shapes, we combine unsupervised implicit neural representations with the maximum correntropy criterion to optimize the deformation of unoccluded regions. This effectively avoids collapsed, tearing, and other physically implausible results. Moreover, we present a theoretical analysis and establish the relationship between the maximum correntropy criterion and the commonly used Chamfer distance, highlighting that the correntropy-induced metric can be served as a more universal measure for point cloud analysis. Additionally, we introduce locally linear reconstruction to ensure that regions lacking correspondences between shapes still undergo physically natural deformations. Our method achieves superior or competitive performance compared to existing approaches, particularly when dealing with occluded geometries. We also demonstrate the versatility of our method in challenging tasks such as large deformations, shape interpolation, and shape completion under occlusion disturbances.
中文: 本文提出了一种无监督的遮挡感知配准方法,通过自适应熵和隐式神经表示有效处理非刚性点云对齐中的遮挡问题,在复杂场景下实现了优越性能。
English: This paper introduces an unsupervised Occlusion-Aware Registration (OAR) method that employs adaptive correntropy and implicit neural representations to effectively handle occlusions in non-rigid point cloud alignment, achieving superior performance in challenging scenarios.

Authors:Haiquan Qiu, You Wu, Dong Li, Jianmin Guo, Quanming Yao
Title: Superpose Task-specific Features for Model Merging
Abstract:
Model merging enables powerful capabilities in neural networks without requiring additional training. In this paper, we introduce a novel perspective on model merging by leveraging the fundamental mechanisms of neural network representation. Our approach is motivated by the linear representation hypothesis, which states that neural networks encode information through linear combinations of feature vectors. We propose a method that superposes task-specific features from individual models into a merged model. Our approach specifically targets linear transformation matrices, which are crucial for feature activation and extraction in deep networks. By formulating the merging process as a linear system, we can preserve task-specific features from individual models and create merged models that effectively maintain multi-task capabilities compared to existing methods. Extensive experiments across diverse benchmarks and models demonstrate that our method outperforms existing techniques. Code is available at https://github.com/LARS-research/STF.
中文: 本文提出一种新颖的模型融合方法,基于线性表示假说,通过针对变换矩阵叠加任务特定特征,在多个基准测试中相比现有技术实现了更优的多任务性能。
English: This paper presents a novel model merging method that leverages the linear representation hypothesis to superpose task-specific features by targeting transformation matrices, achieving superior multi-task performance across benchmarks compared to existing techniques.

Authors:Ahmad Chaddad, Yihang Wu, Yuchen Jiang, Ahmed Bouridane, Christian Desrosiers
Title: Simulations of Common Unsupervised Domain Adaptation Algorithms for Image Classification
Abstract:
Traditional machine learning assumes that training and test sets are derived from the same distribution; however, this assumption does not always hold in practical applications. This distribution disparity can lead to severe performance drops when the trained model is used in new data sets. Domain adaptation (DA) is a machine learning technique that aims to address this problem by reducing the differences between domains. This paper presents simulation-based algorithms of recent DA techniques, mainly related to unsupervised domain adaptation (UDA), where labels are available only in the source domain. Our study compares these techniques with public data sets and diverse characteristics, highlighting their respective strengths and drawbacks. For example, Safe Self-Refinement for Transformer-based DA (SSRT) achieved the highest accuracy (91.6\%) in the office-31 data set during our simulations, however, the accuracy dropped to 72.4\% in the Office-Home data set when using limited batch sizes. In addition to improving the reader's comprehension of recent techniques in DA, our study also highlights challenges and upcoming directions for research in this domain. The codes are available at https://github.com/AIPMLab/Domain_Adaptation.
中文摘要:本文通过仿真比较了最新的无监督领域自适应技术,揭示了它们在不同数据集上的性能差异,并指出了未来研究面临的挑战。
English Summary: This paper compares recent unsupervised domain adaptation techniques through simulations, highlighting their varying performance across datasets and identifying challenges for future research.

Authors:Muhammad Ashad Kabir, Nidita Roy, Md. Ekramul Hossain, Jill Featherston, Sayed Ahmed
Title: Deep Learning for Wound Tissue Segmentation: A Comprehensive Evaluation using A Novel Dataset
Abstract:
Deep learning (DL) techniques have emerged as promising solutions for medical wound tissue segmentation. However, a notable limitation in this field is the lack of publicly available labelled datasets and a standardised performance evaluation of state-of-the-art DL models on such datasets. This study addresses this gap by comprehensively evaluating various DL models for wound tissue segmentation using a novel dataset. We have curated a dataset comprising 147 wound images exhibiting six tissue types: slough, granulation, maceration, necrosis, bone, and tendon. The dataset was meticulously labelled for semantic segmentation employing supervised machine learning techniques. Three distinct labelling formats were developed -- full image, patch, and superpixel. Our investigation encompassed a wide array of DL segmentation and classification methodologies, ranging from conventional approaches like UNet, to generative adversarial networks such as cGAN, and modified techniques like FPN+VGG16. Also, we explored DL-based classification methods (e.g., ResNet50) and machine learning-based classification leveraging DL features (e.g., AlexNet+RF). In total, 82 wound tissue segmentation models were derived across the three labelling formats. Our analysis yielded several notable findings, including identifying optimal DL models for each labelling format based on weighted average Dice or F1 scores. Notably, FPN+VGG16 emerged as the top-performing DL model for wound tissue segmentation, achieving a dice score of 82.25%. This study provides a valuable benchmark for evaluating wound image segmentation and classification models, offering insights to inform future research and clinical practice in wound care. The labelled dataset created in this study is available at https://github.com/akabircs/WoundTissue.
中文摘要:本研究利用包含六种组织类型的147张伤口图像新数据集,评估了多种深度学习模型在伤口组织分割中的表现,确定FPN+VGG16为最佳模型(Dice得分82.25%),为未来伤口护理研究提供了重要基准。
English Summary: This study evaluates various deep learning models for wound tissue segmentation using a novel dataset of 147 images with six tissue types, identifying FPN+VGG16 as the top-performing model with an 82.25% dice score and providing a benchmark for future wound care research.

Authors:Kaiwen Shi, Yifei Li, Binh Ho, Jovian Wang, Kobe Guo
Title: Universal Lesion Segmentation Challenge 2023: A Comparative Research of Different Algorithms
Abstract:
In recent years, machine learning algorithms have achieved much success in segmenting lesions across various tissues. There is, however, not one satisfying model that works well on all tissue types universally. In response to this need, we attempt to train a model that 1) works well on all tissue types, and 2) is capable of still performing fast inferences. To this end, we design our architectures, test multiple existing architectures, compare their results, and settle upon SwinUnet. We document our rationales, successes, and failures. Finally, we propose some further directions that we think are worth exploring. codes: https://github.com/KWFredShi/ULS2023NGKD.git
中文: 研究人员利用SwinUnet架构开发了一种通用病灶分割模型,该模型能在所有组织类型上有效工作并保持快速推理能力,同时公布了研究结果和代码以供进一步探索。
English: Researchers have developed a universal lesion segmentation model using SwinUnet architecture that performs effectively across all tissue types while maintaining fast inference speeds, with findings and code shared for further exploration.

Authors:Aditya Dey, Jonas Kusch, Fadi Al Machot
Title: HADL Framework for Noise Resilient Long-Term Time Series Forecasting
Abstract:
Long-term time series forecasting is critical in domains such as finance, economics, and energy, where accurate and reliable predictions over extended horizons drive strategic decision-making. Despite the progress in machine learning-based models, the impact of temporal noise in extended lookback windows remains underexplored, often degrading model performance and computational efficiency. In this paper, we propose a novel framework that addresses these challenges by integrating the Discrete Wavelet Transform (DWT) and Discrete Cosine Transform (DCT) to perform noise reduction and extract robust long-term features. These transformations enable the separation of meaningful temporal patterns from noise in both the time and frequency domains. To complement this, we introduce a lightweight low-rank linear prediction layer that not only reduces the influence of residual noise but also improves memory efficiency. Our approach demonstrates competitive robustness to noisy input, significantly reduces computational complexity, and achieves competitive or state-of-the-art forecasting performance across diverse benchmark datasets. Extensive experiments reveal that the proposed framework is particularly effective in scenarios with high noise levels or irregular patterns, making it well suited for real-world forecasting tasks. The code is available in https://github.com/forgee-master/HADL.
中文: 本文提出了一种结合离散小波变换和离散余弦变换的新框架,用于长期时间序列预测中的噪声消除和特征提取,该框架增强了对噪声输入的鲁棒性和计算效率,并在多个基准数据集上实现了具有竞争力的预测性能。
English: This paper introduces a novel framework combining Discrete Wavelet and Cosine Transforms for noise reduction and feature extraction in long-term time series forecasting, which enhances robustness to noisy inputs and computational efficiency while achieving competitive performance across benchmarks.

Authors:Kevin Garcia, Juan Manuel Perez, Yifeng Gao
Title: Efficient Hierarchical Contrastive Self-supervising Learning for Time Series Classification via Importance-aware Resolution Selection
Abstract:
Recently, there has been a significant advancement in designing Self-Supervised Learning (SSL) frameworks for time series data to reduce the dependency on data labels. Among these works, hierarchical contrastive learning-based SSL frameworks, which learn representations by contrasting data embeddings at multiple resolutions, have gained considerable attention. Due to their ability to gather more information, they exhibit better generalization in various downstream tasks. However, when the time series data length is significant long, the computational cost is often significantly higher than that of other SSL frameworks. In this paper, to address this challenge, we propose an efficient way to train hierarchical contrastive learning models. Inspired by the fact that each resolution's data embedding is highly dependent, we introduce importance-aware resolution selection based training framework to reduce the computational cost. In the experiment, we demonstrate that the proposed method significantly improves training time while preserving the original model's integrity in extensive time series classification performance evaluations. Our code could be found here, https://github.com/KEEBVIN/IARS
中文: 本文提出了一种高效的层次对比学习训练方法,通过重要性感知的分辨率选择来降低计算成本,同时保持时间序列分类性能。
English: This paper introduces an efficient training method for hierarchical contrastive learning in self-supervised time series analysis, using importance-aware resolution selection to reduce computational costs while maintaining model performance.

Authors:Sifan Tu, Xin Zhou, Dingkang Liang, Xingyu Jiang, Yumeng Zhang, Xiaofan Li, Xiang Bai
Title: The Role of World Models in Shaping Autonomous Driving: A Comprehensive Survey
Abstract:
Driving World Model (DWM), which focuses on predicting scene evolution during the driving process, has emerged as a promising paradigm in pursuing autonomous driving. These methods enable autonomous driving systems to better perceive, understand, and interact with dynamic driving environments. In this survey, we provide a comprehensive overview of the latest progress in DWM. We categorize existing approaches based on the modalities of the predicted scenes and summarize their specific contributions to autonomous driving. In addition, high-impact datasets and various metrics tailored to different tasks within the scope of DWM research are reviewed. Finally, we discuss the potential limitations of current research and propose future directions. This survey provides valuable insights into the development and application of DWM, fostering its broader adoption in autonomous driving. The relevant papers are collected at https://github.com/LMD0311/Awesome-World-Model.
中文: 驾驶世界模型(DWM)作为自动驾驶领域的新范式,通过预测场景演变来提升系统对动态环境的感知与交互能力,本综述系统梳理了其研究进展、数据集、评估指标及未来发展方向。
English: The Driving World Model (DWM) is a promising autonomous driving paradigm that predicts scene evolution, enabling systems to better perceive and interact with dynamic environments, with this survey comprehensively reviewing its progress, datasets, metrics, and future directions.

Authors:Minyang Chen, Chenchen Feng, and Ran Cheng
Title: MetaDE: Evolving Differential Evolution by Differential Evolution
Abstract:
As a cornerstone in the Evolutionary Computation (EC) domain, Differential Evolution (DE) is known for its simplicity and effectiveness in handling challenging black-box optimization problems. While the advantages of DE are well-recognized, achieving peak performance heavily depends on its hyperparameters such as the mutation factor, crossover probability, and the selection of specific DE strategies. Traditional approaches to this hyperparameter dilemma have leaned towards parameter tuning or adaptive mechanisms. However, identifying the optimal settings tailored for specific problems remains a persistent challenge. In response, we introduce MetaDE, an approach that evolves DE's intrinsic hyperparameters and strategies using DE itself at a meta-level. A pivotal aspect of MetaDE is a specialized parameterization technique, which endows it with the capability to dynamically modify DE's parameters and strategies throughout the evolutionary process. To augment computational efficiency, MetaDE incorporates a design that leverages parallel processing through a GPU-accelerated computing framework. Within such a framework, DE is not just a solver but also an optimizer for its own configurations, thus streamlining the process of hyperparameter optimization and problem-solving into a cohesive and automated workflow. Extensive evaluations on the CEC2022 benchmark suite demonstrate MetaDE's promising performance. Moreover, when applied to robot control via evolutionary reinforcement learning, MetaDE also demonstrates promising performance. The source code of MetaDE is publicly accessible at: https://github.com/EMI-Group/metade.
中文摘要:MetaDE 是一种元层面的方法,利用差分进化算法动态优化其自身的超参数和策略,通过GPU加速并行处理提升性能与效率,在CEC2022基准测试和机器人控制应用中均表现出优异效果。
English Summary: MetaDE is a meta-level approach that uses Differential Evolution to evolve its own hyperparameters and strategies dynamically, enhancing performance and efficiency through GPU-accelerated parallel processing, as validated on the CEC2022 benchmark and in robot control applications.

Authors:Zheng Fang, Lichuan Xiang, Xu Cai, Kaicheng Zhou, Hongkai Wen
Title: FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation
Abstract:
ControlNet offers a powerful way to guide diffusion-based generative models, yet most implementations rely on ad-hoc heuristics to choose which network blocks to control-an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that copies all diffusion blocks during training and employs a trainable gating mechanism to dynamically select which blocks to activate at each denoising step. With introducing a computation-aware loss, we can encourage control blocks only to activate when it benefit the generation quality. By eliminating manual block selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline, with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet (e.g., SD1.5) and DiT (e.g., SD3.0), we show that our method outperforms existing ControlNet variants in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks. These results underscore the potential of a flexible, data-driven approach for controlled diffusion and open new avenues for efficient generative model design. The code will soon be available at https://github.com/Anonymousuuser/FlexControl.
中文:FlexControl 通过可训练门控机制动态选择去噪过程中的扩散模块,无需人工干预,在提升任务适应性的同时优化了计算效率。
English: FlexControl introduces a trainable gating mechanism to dynamically select diffusion blocks during denoising, eliminating manual selection and improving both adaptability and computational efficiency across tasks.

Authors:Libo Wang
Title: Dynamic Chain-of-Thought: Towards Adaptive Deep Reasoning
Abstract:
To reduce the cost and consumption of computing resources caused by computational redundancy and delayed reward assignment in long CoT, this research proposes the dynamic chain-of-thought (D-CoT) with adaptive reasoning time and steps. The researcher used simulation experiment to simulate the integration of D-CoT through Python 3.13 IDLE combined with a Python simulator based on GPTs. At the same time, the researcher used DeepSeek R1 as a control group to test and compare the performance of the D-CoT simulator in processing MIT OpenCourseWare's linear algebra exam questions. Experimental results show that D-CoT is better than DeepSeek R1 based on long CoT in three indicators: reasoning time, CoT length (reasoning steps) and token count, which achieves a significant reduction in computing resource consumption. In addition, this research has potential value in deep reasoning optimization that is used as a reference for future dynamic deep reasoning frameworks.
中文: 本研究提出动态思维链(D-CoT),通过自适应推理步骤和时间降低计算成本,在线性代数问题处理中相比DeepSeek R1展现出更优的推理效率与资源节约潜力。
English: This research introduces Dynamic Chain-of-Thought (D-CoT) to reduce computational costs by optimizing reasoning steps and time, demonstrating superior efficiency over DeepSeek R1 in processing linear algebra questions with significant resource savings.

Authors:Wenxuan Guo, Xiuwei Xu, Ziwei Wang, Jianjiang Feng, Jie Zhou, Jiwen Lu
Title: TSP3D: Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding
Abstract:
In this paper, we propose an efficient multi-level convolution architecture for 3D visual grounding. Conventional methods are difficult to meet the requirements of real-time inference due to the two-stage or point-based architecture. Inspired by the success of multi-level fully sparse convolutional architecture in 3D object detection, we aim to build a new 3D visual grounding framework following this technical route. However, as in 3D visual grounding task the 3D scene representation should be deeply interacted with text features, sparse convolution-based architecture is inefficient for this interaction due to the large amount of voxel features. To this end, we propose text-guided pruning (TGP) and completion-based addition (CBA) to deeply fuse 3D scene representation and text features in an efficient way by gradual region pruning and target completion. Specifically, TGP iteratively sparsifies the 3D scene representation and thus efficiently interacts the voxel features with text features by cross-attention. To mitigate the affect of pruning on delicate geometric information, CBA adaptively fixes the over-pruned region by voxel completion with negligible computational overhead. Compared with previous single-stage methods, our method achieves top inference speed and surpasses previous fastest method by 100\% FPS. Our method also achieves state-of-the-art accuracy even compared with two-stage methods, with $+1.13$ lead of Acc@0.5 on ScanRefer, and $+2.6$ and $+3.2$ leads on NR3D and SR3D respectively. The code is available at \href{https://github.com/GWxuan/TSP3D}{https://github.com/GWxuan/TSP3D}.
中文: 本文提出了一种高效的多级卷积架构用于三维视觉定位,通过文本引导剪枝和基于补全的添加方法,实现了最优的推理速度和精度。
English: This paper introduces an efficient multi-level convolution architecture for 3D visual grounding, utilizing text-guided pruning and completion-based addition to achieve state-of-the-art speed and accuracy.

Authors:R. Patrick Xian, Noah R. Baker, Tom David, Qiming Cui, A. Jay Holmgren, Stefan Bauer, Madhumita Sushil, Reza Abbasi-Asl
Title: Robustness tests for biomedical foundation models should tailor to specifications
Abstract:
The rise of biomedical foundation models creates new hurdles in model testing and authorization, given their broad capabilities and susceptibility to complex distribution shifts. We suggest tailoring robustness tests according to task-dependent priorities and propose to integrate granular notions of robustness in a predefined specification to guide implementation. Our approach facilitates the standardization of robustness assessments in the model lifecycle and connects abstract AI regulatory frameworks with concrete testing procedures.
中文: 随着生物医学基础模型的广泛应用,我们建议依据任务优先级定制稳健性测试,并将细化的稳健性概念纳入预定义规范,以连接抽象的人工智能监管框架与具体测试流程,实现模型生命周期中评估的标准化。
English: The increasing use of biomedical foundation models necessitates customized robustness tests aligned with task-specific priorities and detailed specifications to bridge regulatory frameworks with practical testing, standardizing assessments throughout the model lifecycle.

Authors:R. Patrick Xian, Noah R. Baker, Tom David, Qiming Cui, A. Jay Holmgren, Stefan Bauer, Madhumita Sushil, Reza Abbasi-Asl
Title: Robustness tests for biomedical foundation models should tailor to specifications
Abstract:
The rise of biomedical foundation models creates new hurdles in model testing and authorization, given their broad capabilities and susceptibility to complex distribution shifts. We suggest tailoring robustness tests according to task-dependent priorities and propose to integrate granular notions of robustness in a predefined specification to guide implementation. Our approach facilitates the standardization of robustness assessments in the model lifecycle and connects abstract AI regulatory frameworks with concrete testing procedures.
中文: 随着生物医学基础模型的广泛应用,我们建议依据任务优先级定制稳健性测试,并将细化的稳健性概念纳入预定义规范,以连接抽象的人工智能监管框架与具体测试流程,实现模型生命周期中评估的标准化。
English: The increasing use of biomedical foundation models necessitates customized robustness tests aligned with task-specific priorities and detailed specifications to bridge regulatory frameworks with practical testing, standardizing assessments throughout the model lifecycle.

Authors:Yu-Ang Lee, Ching-Yun Ko, Tejaswini Pedapati, I-Hsin Chung, Mi-Yen Yeh, Pin-Yu Chen
Title: STAR: Spectral Truncation and Rescale for Model Merging
Abstract:
Model merging is an efficient way of obtaining a multi-task model from several pretrained models without further fine-tuning, and it has gained attention in various domains, including natural language processing (NLP). Despite the efficiency, a key challenge in model merging is the seemingly inevitable decrease in task performance as the number of models increases. In this paper, we propose $\mathbf{S}$pectral $\mathbf{T}$runcation $\mathbf{A}$nd $\mathbf{R}$escale (STAR) that aims at mitigating ``merging conflicts'' by truncating small components in the respective spectral spaces, which is followed by an automatic parameter rescaling scheme to retain the nuclear norm of the original matrix. STAR requires no additional inference on original training data and is robust to hyperparamater choice. We demonstrate the effectiveness of STAR through extensive model merging cases on diverse NLP tasks. Specifically, STAR works robustly across varying model sizes, and can outperform baselines by 4.2$\%$ when merging 12 models on Flan-T5. Our code is publicly available at https://github.com/IBM/STAR.
中文: 本文提出的STAR方法通过截断谱空间分量和自动参数重缩放来缓解模型合并中的性能下降问题,在多种自然语言处理任务中无需额外数据即实现稳定性能提升。
English: The paper introduces STAR, a method that reduces performance loss in model merging by truncating spectral components and rescaling parameters, showing robust improvements across NLP tasks without needing extra data or fine-tuning.

Authors:Sanjiban Choudhury
Title: Process Reward Models for LLM Agents: Practical Framework and Directions
Abstract:
We introduce Agent Process Reward Models (AgentPRM), a simple and scalable framework for training LLM agents to continually improve through interactions. AgentPRM follows a lightweight actor-critic paradigm, using Monte Carlo rollouts to compute reward targets and optimize policies. It requires minimal modifications to existing RLHF pipelines, making it easy to integrate at scale. Beyond AgentPRM, we propose InversePRM, which learns process rewards directly from demonstrations without explicit outcome supervision. We also explore key challenges and opportunities, including exploration, process reward shaping, and model-predictive reasoning. We evaluate on ALFWorld benchmark, show that small 3B models trained with AgentPRM and InversePRM outperform strong GPT-4o baselines, and analyze test-time scaling, reward hacking, and more. Our code is available at: https://github.com/sanjibanc/agent_prm.
中文: 我们提出了AgentPRM框架,通过轻量级演员-评论家方法和蒙特卡洛推演来训练LLM智能体持续优化,其变体InversePRM可从演示中学习过程奖励,两者在ALFWorld基准测试中均超越了GPT-4o基线模型。
English: We propose AgentPRM, a scalable framework for training LLM agents to improve continuously via actor-critic methods and Monte Carlo rollouts, requiring minimal RLHF pipeline changes, with its variant InversePRM learning rewards from demonstrations and both outperforming GPT-4o on ALFWorld benchmarks.

Authors:Lauri Seppäläinen, Mudong Guo, Kai Puolamäki
Title: ExplainReduce: Summarising local explanations via proxies
Abstract:
Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificial intelligence (XAI) aims to develop tools to examine the inner workings of these closed boxes. An often-used model-agnostic approach to XAI involves using simple models as local approximations to produce so-called local explanations; examples of this approach include LIME, SHAP, and SLISEMAP. This paper shows how a large set of local explanations can be reduced to a small "proxy set" of simple models, which can act as a generative global explanation. This reduction procedure, ExplainReduce, can be formulated as an optimisation problem and approximated efficiently using greedy heuristics.
中文:ExplainReduce方法通过优化算法将大量局部解释简化为一个精简的代理简单模型集合,从而为复杂机器学习系统提供可生成的全局解释。
English: ExplainReduce is a method that compiles numerous local explanations into a concise proxy set of simple models, offering a global understanding of complex machine learning systems through efficient optimization.

Authors:Thien B. Nguyen-Tat, Hoang-An Vo, Phuoc-Sang Dang
Title: QMaxViT-Unet+: A Query-Based MaxViT-Unet with Edge Enhancement for Scribble-Supervised Segmentation of Medical Images
Abstract:
The deployment of advanced deep learning models for medical image segmentation is often constrained by the requirement for extensively annotated datasets. Weakly-supervised learning, which allows less precise labels, has become a promising solution to this challenge. Building on this approach, we propose QMaxViT-Unet+, a novel framework for scribble-supervised medical image segmentation. This framework is built on the U-Net architecture, with the encoder and decoder replaced by Multi-Axis Vision Transformer (MaxViT) blocks. These blocks enhance the model's ability to learn local and global features efficiently. Additionally, our approach integrates a query-based Transformer decoder to refine features and an edge enhancement module to compensate for the limited boundary information in the scribble label. We evaluate the proposed QMaxViT-Unet+ on four public datasets focused on cardiac structures, colorectal polyps, and breast cancer: ACDC, MS-CMRSeg, SUN-SEG, and BUSI. Evaluation metrics include the Dice similarity coefficient (DSC) and the 95th percentile of Hausdorff distance (HD95). Experimental results show that QMaxViT-Unet+ achieves 89.1\% DSC and 1.316mm HD95 on ACDC, 88.4\% DSC and 2.226mm HD95 on MS-CMRSeg, 71.4\% DSC and 4.996mm HD95 on SUN-SEG, and 69.4\% DSC and 50.122mm HD95 on BUSI. These results demonstrate that our method outperforms existing approaches in terms of accuracy, robustness, and efficiency while remaining competitive with fully-supervised learning approaches. This makes it ideal for medical image analysis, where high-quality annotations are often scarce and require significant effort and expense. The code is available at: https://github.com/anpc849/QMaxViT-Unet
中文: QMaxViT-Unet+框架提出了一种基于涂鸦标注的弱监督医学图像分割新方法,通过多轴视觉Transformer模块与查询式解码器的结合,在四个公共数据集上实现了优于现有方法的精确度和鲁棒性。
English: The QMaxViT-Unet+ framework introduces a novel weakly-supervised approach for medical image segmentation, combining Multi-Axis Vision Transformer blocks with query-based decoding and edge enhancement to achieve superior accuracy and efficiency across multiple clinical datasets.

Authors:Aivin V. Solatorio, Rafael Macalaba, James Liounis
Title: Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers
Abstract:
Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. Our results highlight how LLM-generated synthetic data can effectively address training data scarcity, improving generalization in low-resource settings. This framework offers a pathway toward scalable monitoring of dataset usage, enhancing transparency, and supporting researchers, funders, and policymakers in identifying data gaps and strengthening data accessibility for informed decision-making.
中文: 本文提出一种机器学习框架,利用大语言模型和合成数据自动识别研究论文中的数据集引用,其性能优于现有方法,可提升数据可发现性以支持科学决策。
English: This paper introduces a machine learning framework that automates dataset mention detection in research papers using large language models and synthetic data, outperforming existing methods and enhancing data discoverability for better decision-making.

Authors:Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
Title: Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Abstract:
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
中文: Step-Video-T2V是一个拥有300亿参数的最先进文本到视频模型,通过深度压缩变分自编码器和三维全注意力扩散变换器,能生成长达204帧的双语高质量视频,在性能评估中展现出业界领先水平。
English: Step-Video-T2V is a 30B-parameter text-to-video model that generates high-quality 204-frame videos using advanced compression and denoising techniques, achieving state-of-the-art performance in bilingual video synthesis.

Authors:Abdelhakim Benechehab, Vasilii Feofanov, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Balázs Kégl
Title: AdaPTS: Adapting Univariate Foundation Models to Probabilistic Multivariate Time Series Forecasting
Abstract:
Pre-trained foundation models (FMs) have shown exceptional performance in univariate time series forecasting tasks. However, several practical challenges persist, including managing intricate dependencies among features and quantifying uncertainty in predictions. This study aims to tackle these critical limitations by introducing adapters; feature-space transformations that facilitate the effective use of pre-trained univariate time series FMs for multivariate tasks. Adapters operate by projecting multivariate inputs into a suitable latent space and applying the FM independently to each dimension. Inspired by the literature on representation learning and partially stochastic Bayesian neural networks, we present a range of adapters and optimization/inference strategies. Experiments conducted on both synthetic and real-world datasets confirm the efficacy of adapters, demonstrating substantial enhancements in forecasting accuracy and uncertainty quantification compared to baseline methods. Our framework, AdaPTS, positions adapters as a modular, scalable, and effective solution for leveraging time series FMs in multivariate contexts, thereby promoting their wider adoption in real-world applications. We release the code at https://github.com/abenechehab/AdaPTS.
中文摘要:本研究引入适配器作为特征空间转换方法,使预训练的单变量时间序列基础模型能够有效处理多变量预测任务,显著提升了预测精度和不确定性量化能力。
English Summary: This study introduces adapters as feature-space transformations to enable pre-trained univariate time series foundation models to handle multivariate forecasting tasks effectively, improving both accuracy and uncertainty quantification.

Authors:Laurin Luttmann, Lin Xie
Title: Learning to Solve the Min-Max Mixed-Shelves Picker-Routing Problem via Hierarchical and Parallel Decoding
Abstract:
The Mixed-Shelves Picker Routing Problem (MSPRP) is a fundamental challenge in warehouse logistics, where pickers must navigate a mixed-shelves environment to retrieve SKUs efficiently. Traditional heuristics and optimization-based approaches struggle with scalability, while recent machine learning methods often rely on sequential decision-making, leading to high solution latency and suboptimal agent coordination. In this work, we propose a novel hierarchical and parallel decoding approach for solving the min-max variant of the MSPRP via multi-agent reinforcement learning. While our approach generates a joint distribution over agent actions, allowing for fast decoding and effective picker coordination, our method introduces a sequential action selection to avoid conflicts in the multi-dimensional action space. Experiments show state-of-the-art performance in both solution quality and inference speed, particularly for large-scale and out-of-distribution instances. Our code is publicly available at http://github.com/LTluttmann/marl4msprp.
中文摘要:本文提出一种基于多智能体强化学习的层次并行解码方法,通过顺序动作选择避免冲突,在解决最小最大混合货架拣选路径问题时实现了最优解质量和推理速度的突破。
English Summary: This paper introduces a hierarchical parallel decoding method using multi-agent reinforcement learning to efficiently solve the min-max Mixed-Shelves Picker Routing Problem, achieving superior solution quality and inference speed through conflict-free sequential action selection.

Authors:Ruslan Agishev, Karel Zimmermann
Title: FusionForce: End-to-end Differentiable Neural-Symbolic Layer for Trajectory Prediction
Abstract:
We propose end-to-end differentiable model that predicts robot trajectories on rough offroad terrain from camera images and/or lidar point clouds. The model integrates a learnable component that predicts robot-terrain interaction forces with a neural-symbolic layer that enforces the laws of classical mechanics and consequently improves generalization on out-of-distribution data. The neural-symbolic layer includes a differentiable physics engine that computes the robot's trajectory by querying these forces at the points of contact with the terrain. As the proposed architecture comprises substantial geometrical and physics priors, the resulting model can also be seen as a learnable physics engine conditioned on real sensor data that delivers $10^4$ trajectories per second. We argue and empirically demonstrate that this architecture reduces the sim-to-real gap and mitigates out-of-distribution sensitivity. The differentiability, in conjunction with the rapid simulation speed, makes the model well-suited for various applications including model predictive control, trajectory shooting, supervised and reinforcement learning, or SLAM.
中文: 本文提出了一种端到端的可微分模型,通过融合可学习的力预测与神经符号层来执行经典力学定律,利用传感器数据预测机器人在崎岖地形上的轨迹,提升泛化能力并实现高速仿真,适用于多种应用场景。
English: This paper introduces an end-to-end differentiable model that predicts robot trajectories on rough terrain using sensor data, integrating learnable force predictions with a neural-symbolic layer to enforce physics laws and enhance generalization, while enabling rapid simulation for diverse applications.

Authors:Saad Ahmed Jamal
Title: Statistical data analysis for Tourism in Poland in R Programming Environment
Abstract:
This study utilises the R programming language for statistical data analysis to understand Tourism dynamics in Poland. It focuses on methods for data visualisation, multivariate statistics, and hypothesis testing. To investigate the expenditure behavior of tourist, spending patterns, correlations, and associations among variables were analysed in the dataset. The results revealed a significant relationship between accommodation type and the purpose of trip, showing that the purpose of a trip impacts the selection of accommodation. A strong correlation was observed between organizer expenditure and private expenditure, indicating that individual spending are more when the spending on organizing the trip are higher. However, no significant difference was observed in total expenditure across different accommodation types and purpose of the trip revealing that travelers tend to spend similar amounts regardless of their reason for travel or choice of accommodation. Although significant relationships were observed among certain variables, ANOVA could not be applied because the dataset was not able to hold on the normality assumption. In future, the dataset can be explored further to find more meaningful insights. The developed code is available on GitHub: https://github.com/SaadAhmedJamal/DataAnalysis RProgEnv.
中文: 本研究利用R语言分析波兰旅游业,发现旅行目的影响住宿选择、个人支出与组织费用相关,但不同旅行类型总支出保持稳定。
English: This study analyzes Polish tourism using R to reveal that trip purpose influences accommodation choice and personal spending correlates with organizer costs, though total expenditures remain consistent across travel types.

Authors:Trevor E. Pogue, Nicola Nicolici
Title: Strassen Multisystolic Array Hardware Architectures
Abstract:
While Strassen's matrix multiplication algorithm reduces the complexity of naive matrix multiplication, general-purpose hardware is not suitable for achieving the algorithm's promised theoretical speedups. This leaves the question of if it could be better exploited in custom hardware architectures designed specifically for executing the algorithm. However, there is limited prior work on this and it is not immediately clear how to derive such architectures or if they can ultimately lead to real improvements. We bridge this gap, presenting and evaluating new systolic array architectures that efficiently translate the theoretical complexity reductions of Strassen's algorithm directly into hardware resource savings. Furthermore, the architectures are multisystolic array designs that can multiply smaller matrices with higher utilization than single-systolic array designs. The proposed designs implemented on FPGA reduce DSP requirements by a factor of $1.14^r$ for $r$ implemented Strassen recursion levels, and otherwise require overall similar soft logic resources when instantiated to support matrix sizes down to 32x32 and 24x24 at 1-2 levels of Strassen recursion, respectively. We evaluate the proposed designs both in isolation and in an end-to-end machine learning accelerator compared to baseline designs and prior works, achieving state-of-the-art performance.
中文: 本研究通过提出多脉动阵列架构,将Strassen算法的理论复杂度降低转化为硬件资源节约,填补了定制硬件领域的空白,并在FPGA实现和机器学习加速器中达到了领先性能。
English: This work bridges the gap in custom hardware for Strassen's algorithm by introducing multisystolic array architectures that translate its theoretical complexity reductions into hardware resource savings, achieving state-of-the-art performance in FPGA implementations and machine learning accelerators.

Authors:Luca Parolari, Andrea Cherubini, Lamberto Ballan, Carlo Biffi
Title: Towards Polyp Counting In Full-Procedure Colonoscopy Videos
Abstract:
Automated colonoscopy reporting holds great potential for enhancing quality control and improving cost-effectiveness of colonoscopy procedures. A major challenge lies in the automated identification, tracking, and re-association (ReID) of polyps tracklets across full-procedure colonoscopy videos. This is essential for precise polyp counting and enables automated computation of key quality metrics, such as Adenoma Detection Rate (ADR) and Polyps Per Colonoscopy (PPC). However, polyp ReID is challenging due to variations in polyp appearance, frequent disappearance from the field of view, and occlusions. In this work, we leverage the REAL-Colon dataset, the first open-access dataset providing full-procedure videos, to define tasks, data splits and metrics for the problem of automatically count polyps in full-procedure videos, establishing an open-access framework. We re-implement previously proposed SimCLR-based methods for learning representations of polyp tracklets, both single-frame and multi-view, and adapt them to the polyp counting task. We then propose an Affinity Propagation-based clustering method to further improve ReID based on these learned representations, ultimately enhancing polyp counting. Our approach achieves state-of-the-art performance, with a polyp fragmentation rate of 6.30 and a false positive rate (FPR) below 5% on the REAL-Colon dataset. We release code at https://github.com/lparolari/towards-polyp-counting.
中文摘要:自动化结肠镜报告通过先进的息肉追踪与重识别方法,在REAL-Colon数据集上实现了低碎片率和误报率的最优性能,从而提升质量控制与成本效益。
English Summary: Automated colonoscopy reporting can enhance quality control and cost-effectiveness by accurately counting polyps through advanced tracking and re-identification methods, achieving state-of-the-art performance with low fragmentation and false positive rates on the REAL-Colon dataset.

Authors:Riccardo Bravin, Massimo Pavan, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri
Title: EmbBERT-Q: Breaking Memory Barriers in Embedded NLP
Abstract:
Large Language Models (LLMs) have revolutionized natural language processing, setting new standards across a wide range of applications. However, their relevant memory and computational demands make them impractical for deployment on technologically-constrained tiny devices such as wearable devices and Internet-of-Things units. To address this limitation, we introduce EmbBERT-Q, a novel tiny language model specifically designed for tiny devices with stringent memory constraints. EmbBERT-Q achieves state-of-the-art (SotA) accuracy in Natural Language Processing tasks in this scenario, with a total memory footprint (weights and activations) of just 781 kB, representing a 25x reduction in size with respect to SotA models. By combining architectural innovations with hardware-compatible 8-bit quantization, EmbBERT-Q consistently outperforms several baseline models scaled down to a 2 MB memory budget (i.e., the maximum memory typically available in tiny devices), including heavily compressed versions of BERT and MAMBA. Extensive experimental evaluations on both a selected benchmark dataset, TinyNLP, specifically curated to evaluate Tiny Language Models in NLP tasks and real-world scenarios, and the GLUE benchmark, demonstrate EmbBERT-Q ability to deliver competitive accuracy with respect to existing approaches, achieving an unmatched balance between memory and performance. To ensure the complete and immediate reproducibility of all our results, we release all code, scripts, and model checkpoints at https://github.com/RiccardoBravin/tiny-LLM.
中文:EmbBERT-Q是一种专为内存受限微型设备设计的新型微型语言模型,通过架构创新和8位量化技术,仅用781 kB内存占用就实现了最先进的准确率,尺寸缩小了25倍。
English: EmbBERT-Q is a novel tiny language model designed for memory-constrained tiny devices, achieving state-of-the-art accuracy with a 25x size reduction to just 781 kB through architectural innovations and 8-bit quantization.

Authors:Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao
Title: X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability
Abstract:
Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.
中文摘要:本研究提出X-Boundary方法,通过精确区分安全与有害特征表示并仅消除后者,在保持大语言模型通用能力的同时,将过度拒绝率降低约20%,实现了对多轮越狱攻击的最优防御效果。
English Summary: The study introduces X-Boundary, a novel defense method that enhances LLM robustness against multi-turn jailbreaks by precisely distinguishing and erasing harmful representations while preserving usability and reducing over-refusal by approximately 20%.

Authors:Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, Minhao Cheng
Title: LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing
Abstract:
Effectively incorporating external knowledge into Large Language Models (LLMs) is crucial for enhancing their capabilities and addressing real-world needs. Retrieval-Augmented Generation (RAG) offers an effective method for achieving this by retrieving the most relevant fragments into LLMs. However, the advancements in context window size for LLMs offer an alternative approach, raising the question of whether RAG remains necessary for effectively handling external knowledge. Several existing studies provide inconclusive comparisons between RAG and long-context (LC) LLMs, largely due to limitations in the benchmark designs. In this paper, we present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. Through systematic evaluation of seven open-source and four proprietary LLMs, we find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks. Our findings provide actionable guidelines for practitioners to effectively leverage both RAG and LC approaches in developing and deploying LLM applications. Our code and dataset is provided at: \href{https://github.com/Alibaba-NLP/LaRA}{\textbf{https://github.com/Alibaba-NLP/LaRA}}.
中文: LaRA基准测试表明,检索增强生成(RAG)与长上下文大模型的选择取决于模型规模、任务类型等多重因素,为有效整合外部知识提供了实用指导。
English: The LaRA benchmark reveals that the choice between Retrieval-Augmented Generation (RAG) and long-context LLMs depends on multiple factors like model size and task type, offering practical guidance for effectively integrating external knowledge into LLMs.

Authors:Siqi Wu, Yinda Chen, Dong Liu, Zhihai He
Title: Conditional Latent Coding with Learnable Synthesized Reference for Deep Image Compression
Abstract:
In this paper, we study how to synthesize a dynamic reference from an external dictionary to perform conditional coding of the input image in the latent domain and how to learn the conditional latent synthesis and coding modules in an end-to-end manner. Our approach begins by constructing a universal image feature dictionary using a multi-stage approach involving modified spatial pyramid pooling, dimension reduction, and multi-scale feature clustering. For each input image, we learn to synthesize a conditioning latent by selecting and synthesizing relevant features from the dictionary, which significantly enhances the model's capability in capturing and exploring image source correlation. This conditional latent synthesis involves a correlation-based feature matching and alignment strategy, comprising a Conditional Latent Matching (CLM) module and a Conditional Latent Synthesis (CLS) module. The synthesized latent is then used to guide the encoding process, allowing for more efficient compression by exploiting the correlation between the input image and the reference dictionary. According to our theoretical analysis, the proposed conditional latent coding (CLC) method is robust to perturbations in the external dictionary samples and the selected conditioning latent, with an error bound that scales logarithmically with the dictionary size, ensuring stability even with large and diverse dictionaries. Experimental results on benchmark datasets show that our new method improves the coding performance by a large margin (up to 1.2 dB) with a very small overhead of approximately 0.5\% bits per pixel. Our code is publicly available at https://github.com/ydchen0806/CLC.
中文: 本文提出了一种条件潜在编码方法,通过从外部字典合成动态参考来利用图像源相关性增强压缩性能,在极低比特开销下实现了显著的编码质量提升。
English: This paper introduces a conditional latent coding method that synthesizes dynamic references from an external dictionary to enhance image compression by exploiting source correlations, achieving significant performance gains with minimal overhead.

Authors:Ishika Agarwal, Dilek Hakkani-Tür
Title: Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data
Abstract:
Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks -- which we refer to as the InfluenceNetwork -- to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.
中文: 本文提出InfluenceNetwork方法,通过小型神经网络高效估算数据影响力值,成本降低高达99%,且性能与传统影响力函数相当。
English: This paper introduces InfluenceNetwork, a method using small neural networks to efficiently estimate data influence values with up to 99% cost reduction while maintaining performance comparable to traditional influence functions.

Authors:Kun Guo, Gang Cao, Zijie Lou, Xianglin Huang, Jiaoyun Liu
Title: A Lightweight and Effective Image Tampering Localization Network with Vision Mamba
Abstract:
Current image tampering localization methods primarily rely on Convolutional Neural Networks (CNNs) and Transformers. While CNNs suffer from limited local receptive fields, Transformers offer global context modeling at the expense of quadratic computational complexity. Recently, the state space model Mamba has emerged as a competitive alternative, enabling linear-complexity global dependency modeling. Inspired by it, we propose a lightweight and effective FORensic network based on vision MAmba (ForMa) for blind image tampering localization. Firstly, ForMa captures multi-scale global features that achieves efficient global dependency modeling through linear complexity. Then the pixel-wise localization map is generated by a lightweight decoder, which employs a parameter-free pixel shuffle layer for upsampling. Additionally, a noise-assisted decoding strategy is proposed to integrate complementary manipulation traces from tampered images, boosting decoder sensitivity to forgery cues. Experimental results on 10 standard datasets demonstrate that ForMa achieves state-of-the-art generalization ability and robustness, while maintaining the lowest computational complexity. Code is available at https://github.com/multimediaFor/ForMa.
中文:提出的ForMa网络利用视觉Mamba实现线性复杂度的全局特征建模,并结合噪声辅助解码策略,以最低计算成本实现了最先进的图像篡改定位性能。
English: The proposed ForMa network leverages vision Mamba for linear-complexity global feature modeling and integrates noise-assisted decoding to achieve state-of-the-art image tampering localization with minimal computational cost.

Authors:Jiankang Chen, Tianke Zhang, Changyi Liu, Haojie Ding, Yaya Shi, Feng Cheng, Huihui Xiao, Bin Wen, Fan Yang, Tingting Gao, Di Zhang
Title: TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types
Abstract:
Multimodal visual language models are gaining prominence in open-world applications, driven by advancements in model architectures, training techniques, and high-quality data. However, their performance is often limited by insufficient task-specific data, leading to poor generalization and biased outputs. Existing efforts to increase task diversity in fine-tuning datasets are hindered by the labor-intensive process of manual task labeling, which typically produces only a few hundred task types. To address this, we propose TaskGalaxy, a large-scale multimodal instruction fine-tuning dataset comprising 19,227 hierarchical task types and 413,648 samples. TaskGalaxy utilizes GPT-4o to enrich task diversity by expanding from a small set of manually defined tasks, with CLIP and GPT-4o filtering those that best match open-source images, and generating relevant question-answer pairs. Multiple models are employed to ensure sample quality. This automated process enhances both task diversity and data quality, reducing manual intervention. Incorporating TaskGalaxy into LLaVA-v1.5 and InternVL-Chat-v1.0 models shows substantial performance improvements across 16 benchmarks, demonstrating the critical importance of task diversity. TaskGalaxy is publicly released at https://github.com/Kwai-YuanQi/TaskGalaxy.
Chinese: TaskGalaxy是一个大规模多模态指令微调数据集,通过自动化生成显著提升了任务多样性和数据质量,使LLaVA-v1.5和InternVL-Chat-v1.0等模型在多个基准测试中取得了显著性能提升。
English: TaskGalaxy is a large-scale multimodal instruction fine-tuning dataset that significantly enhances task diversity and data quality through automated generation, leading to substantial performance improvements in models like LLaVA-v1.5 and InternVL-Chat-v1.0 across multiple benchmarks.

Authors:Kehan Guo, Yili Shen, Gisela Abigail Gonzalez-Montiel, Yue Huang, Yujun Zhou, Mihir Surve, Zhichun Guo, Prayel Das, Nitesh V Chawla, Olaf Wiest, Xiangliang Zhang
Title: Artificial Intelligence in Spectroscopy: Advancing Chemistry from Prediction to Generation and Beyond
Abstract:
The rapid advent of machine learning (ML) and artificial intelligence (AI) has catalyzed major transformations in chemistry, yet the application of these methods to spectroscopic and spectrometric data, referred to as Spectroscopy Machine Learning (SpectraML), remains relatively underexplored. Modern spectroscopic techniques (MS, NMR, IR, Raman, UV-Vis) generate an ever-growing volume of high-dimensional data, creating a pressing need for automated and intelligent analysis beyond traditional expert-based workflows. In this survey, we provide a unified review of SpectraML, systematically examining state-of-the-art approaches for both forward tasks (molecule-to-spectrum prediction) and inverse tasks (spectrum-to-molecule inference). We trace the historical evolution of ML in spectroscopy, from early pattern recognition to the latest foundation models capable of advanced reasoning, and offer a taxonomy of representative neural architectures, including graph-based and transformer-based methods. Addressing key challenges such as data quality, multimodal integration, and computational scalability, we highlight emerging directions such as synthetic data generation, large-scale pretraining, and few- or zero-shot learning. To foster reproducible research, we also release an open-source repository containing recent papers and their corresponding curated datasets (https://github.com/MINE-Lab-ND/SpectrumML_Survey_Papers). Our survey serves as a roadmap for researchers, guiding progress at the intersection of spectroscopy and AI.
中文摘要:本综述系统梳理了光谱机器学习领域,涵盖从分子到光谱的预测及光谱到分子的推断任务,追溯了该领域从模式识别到先进基础模型的发展历程,并指出了合成数据生成与小样本学习等新兴研究方向。
English Summary: This survey provides a comprehensive review of Spectroscopy Machine Learning (SpectraML), examining forward and inverse tasks while tracing its evolution from pattern recognition to advanced foundation models, and highlights emerging directions like synthetic data generation and few-shot learning.

Authors:Peng Ling, Wenxiao Xiong
Title: FrGNet: A fourier-guided weakly-supervised framework for nuclear instance segmentation
Abstract:
Nuclear instance segmentation has played a critical role in pathology image analysis. The main challenges arise from the difficulty in accurately segmenting instances and the high cost of precise mask-level annotations for fully-supervised training.In this work, we propose a fourier guidance framework for solving the weakly-supervised nuclear instance segmentation problem. In this framework, we construct a fourier guidance module to fuse the priori information into the training process of the model, which facilitates the model to capture the relevant features of the nuclear. Meanwhile, in order to further improve the model's ability to represent the features of nuclear, we propose the guide-based instance level contrastive module. This module makes full use of the framework's own properties and guide information to effectively enhance the representation features of nuclear. We show on two public datasets that our model can outperform current SOTA methods under fully-supervised design, and in weakly-supervised experiments, with only a small amount of labeling our model still maintains close to the performance under full supervision.In addition, we also perform generalization experiments on a private dataset, and without any labeling, our model is able to segment nuclear images that have not been seen during training quite effectively. As open science, all codes and pre-trained models are available at https://github.com/LQY404/FrGNet.
中文摘要:本研究提出了一种傅里叶引导框架用于弱监督核实例分割,通过融合先验信息和对比学习增强特征表示能力,在少量标注下即可达到接近全监督的性能水平。
English Summary: This study introduces a Fourier guidance framework for weakly-supervised nuclear instance segmentation, which integrates prior information and contrastive learning to enhance feature representation, achieving state-of-the-art performance with minimal labeling requirements.

Authors:Jinpei Guo, Zheng Chen, Wenbo Li, Yong Guo, Yulun Zhang
Title: Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal
Abstract:
Diffusion models have demonstrated remarkable success in image restoration tasks. However, their multi-step denoising process introduces significant computational overhead, limiting their practical deployment. Furthermore, existing methods struggle to effectively remove severe JPEG artifact, especially in highly compressed images. To address these challenges, we propose CODiff, a compression-aware one-step diffusion model for JPEG artifact removal. The core of CODiff is the compression-aware visual embedder (CaVE), which extracts and leverages JPEG compression priors to guide the diffusion model. We propose a dual learning strategy that combines explicit and implicit learning. Specifically, explicit learning enforces a quality prediction objective to differentiate low-quality images with different compression levels. Implicit learning employs a reconstruction objective that enhances the model's generalization. This dual learning allows for a deeper and more comprehensive understanding of JPEG compression. Experimental results demonstrate that CODiff surpasses recent leading methods in both quantitative and visual quality metrics. The code is released at https://github.com/jp-guo/CODiff.
中文:CODiff是一种创新的单步扩散模型,通过压缩感知视觉嵌入器和双重学习策略有效去除JPEG伪影,在定量和视觉质量指标上均超越了现有领先方法。
English: CODiff is a novel one-step diffusion model that utilizes a compression-aware visual embedder and dual learning strategy to effectively remove JPEG artifacts, achieving superior performance in both quantitative and visual quality metrics.

Authors:Chris Zhuang, Debadyuti Mukherjee, Yingzhou Lu, Tianfan Fu, Ruqi Zhang
Title: Gradient GA: Gradient Genetic Algorithm for Drug Molecular Design
Abstract:
Molecular discovery has brought great benefits to the chemical industry. Various molecule design techniques are developed to identify molecules with desirable properties. Traditional optimization methods, such as genetic algorithms, continue to achieve state-of-the-art results across multiple molecular design benchmarks. However, these techniques rely solely on random walk exploration, which hinders both the quality of the final solution and the convergence speed. To address this limitation, we propose a novel approach called Gradient Genetic Algorithm (Gradient GA), which incorporates gradient information from the objective function into genetic algorithms. Instead of random exploration, each proposed sample iteratively progresses toward an optimal solution by following the gradient direction. We achieve this by designing a differentiable objective function parameterized by a neural network and utilizing the Discrete Langevin Proposal to enable gradient guidance in discrete molecular spaces. Experimental results demonstrate that our method significantly improves both convergence speed and solution quality, outperforming cutting-edge techniques. For example, it achieves up to a 25% improvement in the top-10 score over the vanilla genetic algorithm. The code is publicly available at https://github.com/debadyuti23/GradientGA.
Chinese Summary: 梯度遗传算法通过引入目标函数的梯度信息来优化分子设计,相比传统方法显著提升了收敛速度和解决方案质量。
English Summary: The Gradient Genetic Algorithm (Gradient GA) enhances traditional genetic algorithms by incorporating gradient information to guide molecular optimization, significantly improving both convergence speed and solution quality in molecular design.

Authors:Anzo Teh, Mark Jabbour, Yury Polyanskiy
Title: Solving Empirical Bayes via Transformers
Abstract:
This work applies modern AI tools (transformers) to solving one of the oldest statistical problems: Poisson means under empirical Bayes (Poisson-EB) setting. In Poisson-EB a high-dimensional mean vector $θ$ (with iid coordinates sampled from an unknown prior $π$) is estimated on the basis of $X=\mathrm{Poisson}(θ)$. A transformer model is pre-trained on a set of synthetically generated pairs $(X,θ)$ and learns to do in-context learning (ICL) by adapting to unknown $π$. Theoretically, we show that a sufficiently wide transformer can achieve vanishing regret with respect to an oracle estimator who knows $π$ as dimension grows to infinity. Practically, we discover that already very small models (100k parameters) are able to outperform the best classical algorithm (non-parametric maximum likelihood, or NPMLE) both in runtime and validation loss, which we compute on out-of-distribution synthetic data as well as real-world datasets (NHL hockey, MLB baseball, BookCorpusOpen). Finally, by using linear probes, we confirm that the transformer's EB estimator appears to internally work differently from either NPMLE or Robbins' estimators.
本研究应用Transformer人工智能解决泊松经验贝叶斯问题,证明即使小型模型也能在速度和精度上超越传统方法,同时通过独特的内部机制运作。
This study employs transformer AI to tackle the Poisson empirical Bayes problem, demonstrating that even small models can surpass traditional methods in speed and accuracy while operating through distinct internal mechanisms.

Authors:Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, Siliang Tang, Jun Xiao, Hui Lin, Yueting Zhuang, Beng Chin Ooi
Title: HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
Abstract:
We present HealthGPT, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained large language models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation (H-LoRA) technique, which is complemented by a tailored hierarchical visual perception approach and a three-stage learning strategy. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.
中文: HealthGPT是一种统一的自回归医学大视觉语言模型,通过创新的异构低秩适应技术和分层视觉感知,在医学视觉任务中展现出卓越的性能与扩展性。
English: HealthGPT is a unified autoregressive Med-LVLM that integrates medical visual comprehension and generation through a novel H-LoRA technique and hierarchical perception, achieving exceptional performance in medical tasks.

Authors:Saurabh Chauhan, Zeeshan Rasheed, Abdul Malik Sami, Zheying Zhang, Jussi Rasku, Kai-Kristian Kemell, Pekka Abrahamsson
Title: LLM-Generated Microservice Implementations from RESTful API Definitions
Abstract:
The growing need for scalable, maintainable, and fast-deploying systems has made microservice architecture widely popular in software development. This paper presents a system that uses Large Language Models (LLMs) to automate the API-first development of RESTful microservices. This system assists in creating OpenAPI specification, generating server code from it, and refining the code through a feedback loop that analyzes execution logs and error messages. By focusing on the API-first methodology, this system ensures that microservices are designed with well-defined interfaces, promoting consistency and reliability across the development life-cycle. The integration of log analysis enables the LLM to detect and address issues efficiently, reducing the number of iterations required to produce functional and robust services. This process automates the generation of microservices and also simplifies the debugging and refinement phases, allowing developers to focus on higher-level design and integration tasks. This system has the potential to benefit software developers, architects, and organizations to speed up software development cycles and reducing manual effort. To assess the potential of the system, we conducted surveys with six industry practitioners. After surveying practitioners, the system demonstrated notable advantages in enhancing development speed, automating repetitive tasks, and simplifying the prototyping process. While experienced developers appreciated its efficiency for specific tasks, some expressed concerns about its limitations in handling advanced customizations and larger scale projects. The code is publicly available at https://github.com/sirbh/code-gen
中文: 本文提出了一种利用大语言模型自动化开发RESTful微服务的系统,通过API优先设计和日志分析优化代码,显著提升开发效率并减少人工投入。
English: This paper introduces a system that automates RESTful microservice development using Large Language Models, streamlining API-first design and code refinement through log analysis to boost efficiency and reduce manual effort.

Authors:Qingsong Zou, Jingyu Xiao, Qing Li, Zhi Yan, Yuhang Wang, Li Xu, Wenxuan Wang, Kuofeng Gao, Ruoyu Li, Yong Jiang
Title: QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language
Abstract:
Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack not only can achieve high attack success rates (ASRs), but also can jailbreak various defense methods. Furthermore, we tailor a defense method against QueryAttack, which can reduce ASR by up to $64\%$ on GPT-4-1106. Our code is available at https://github.com/horizonsinzqs/QueryAttack.
中文: 本文提出QueryAttack框架,通过将恶意自然语言查询转换为结构化非自然查询来绕过大语言模型的安全对齐机制,不仅能实现高攻击成功率,还设计了可将GPT-4-1106攻击成功率降低64%的防御方案。
English: This paper introduces QueryAttack, a framework that bypasses safety alignment in large language models by converting malicious natural language queries into structured non-natural queries, achieving high attack success rates while also proposing a defense method that reduces attack effectiveness by up to 64% on GPT-4-1106.

Authors:Benedikt Alkin, Maurits Bleeker, Richard Kurle, Tobias Kronlachner, Reinhard Sonnleitner, Matthias Dorfer, Johannes Brandstetter
Title: AB-UPT: Scaling Neural CFD Surrogates for High-Fidelity Automotive Aerodynamics Simulations via Anchored-Branched Universal Physics Transformers
Abstract:
Recent advances in neural surrogate modeling offer the potential for transformative innovations in applications such as automotive aerodynamics. Yet, industrial-scale problems often involve volumetric meshes with cell counts reaching 100 million, presenting major scalability challenges. Complex geometries further complicate modeling through intricate surface-volume interactions, while quantities such as vorticity are highly nonlinear and must satisfy strict divergence-free constraints. To address these requirements, we introduce Anchored-Branched Universal Physics Transformers (AB-UPT) as a novel modeling scheme for building neural surrogates for computational fluid dynamics (CFD) simulations. AB-UPT is designed to: (i) decouple geometry encoding and prediction tasks via multi-branch operators; (ii) enable scalability to high-resolution outputs via neural simulation in a low-dimensional latent space, coupled with anchored neural field decoders to predict high-fidelity outputs; (iii) enforce physics consistency by a novel divergence-free formulation. We show that AB-UPT yields state-of-the-art predictive accuracy of surface and volume fields on automotive CFD simulations ranging from 33 thousand up to 150 million mesh cells. Furthermore, our anchored neural field architecture enables the enforcement of hard physical constraints on the physics predictions without degradation in performance, exemplified by modeling divergence-free vorticity fields. Notably, the proposed models can be trained on a single GPU in less than a day and predict industry-standard surface and volume fields within seconds. Additionally, we show that the flexible design of our method enables neural simulation from a computer-aided design geometry alone, omitting the need for costly CFD meshing procedures.
Chinese: AB-UPT模型提出了一种新颖的计算流体动力学神经代理方法,通过解耦几何编码与预测任务,在单个GPU上实现高效训练的同时强制执行无散度涡量场等物理约束,有效解决了工业级仿真的可扩展性难题。
English: The AB-UPT model introduces a novel neural surrogate approach for computational fluid dynamics that overcomes scalability challenges in industrial-scale simulations by decoupling geometry encoding and prediction, enabling efficient training on a single GPU while enforcing strict physical constraints like divergence-free vorticity fields.

Authors:Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, Xiaohua Jia
Title: The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions
Abstract:
Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.
中文摘要:大语言模型的安全对齐行为由激活空间中的多维方向共同控制,次要方向在塑造拒绝行为和揭示通过令牌操作绕过安全能力的脆弱性方面发挥关键作用。
English Summary: Large Language Models' safety behaviors are governed by multi-dimensional activation directions rather than a single vector, with secondary directions playing crucial roles in shaping refusal responses and revealing vulnerabilities through token manipulation.

Authors:Xiaohong Liu, Xulong Zhao, Gang Liu, Zili Wu, Tao Wang, Lei Meng, Yuhan Wang
Title: IMM-MOT: A Novel 3D Multi-object Tracking Framework with Interacting Multiple Model Filter
Abstract:
3D Multi-Object Tracking (MOT) provides the trajectories of surrounding objects, assisting robots or vehicles in smarter path planning and obstacle avoidance. Existing 3D MOT methods based on the Tracking-by-Detection framework typically use a single motion model to track an object throughout its entire tracking process. However, objects may change their motion patterns due to variations in the surrounding environment. In this paper, we introduce the Interacting Multiple Model filter in IMM-MOT, which accurately fits the complex motion patterns of individual objects, overcoming the limitation of single-model tracking in existing approaches. In addition, we incorporate a Damping Window mechanism into the trajectory lifecycle management, leveraging the continuous association status of trajectories to control their creation and termination, reducing the occurrence of overlooked low-confidence true targets. Furthermore, we propose the Distance-Based Score Enhancement module, which enhances the differentiation between false positives and true positives by adjusting detection scores, thereby improving the effectiveness of the Score Filter. On the NuScenes Val dataset, IMM-MOT outperforms most other single-modal models using 3D point clouds, achieving an AMOTA of 73.8%. Our project is available at https://github.com/Ap01lo/IMM-MOT.
中文:IMM-MOT通过交互多模型滤波器精确跟踪物体运动模式,结合阻尼窗口管理轨迹生命周期和基于距离的分数增强模块,在NuScenes数据集上实现了73.8%的AMOTA性能。
English: IMM-MOT introduces an Interacting Multiple Model filter for accurate motion pattern tracking, a Damping Window for trajectory management, and a Distance-Based Score Enhancement module, achieving 73.8% AMOTA on the NuScenes dataset.

Authors:Maizhe Yang, Kaiyuan Tang, Chaoli Wang
Title: Meta-INR: Efficient Encoding of Volumetric Data via Meta-Learning Implicit Neural Representation
Abstract:
Implicit neural representation (INR) has emerged as a promising solution for encoding volumetric data, offering continuous representations and seamless compatibility with the volume rendering pipeline. However, optimizing an INR network from randomly initialized parameters for each new volume is computationally inefficient, especially for large-scale time-varying or ensemble volumetric datasets where volumes share similar structural patterns but require independent training. To close this gap, we propose Meta-INR, a pretraining strategy adapted from meta-learning algorithms to learn initial INR parameters from partial observation of a volumetric dataset. Compared to training an INR from scratch, the learned initial parameters provide a strong prior that enhances INR generalizability, allowing significantly faster convergence with just a few gradient updates when adapting to a new volume and better interpretability when analyzing the parameters of the adapted INRs. We demonstrate that Meta-INR can effectively extract high-quality generalizable features that help encode unseen similar volume data across diverse datasets. Furthermore, we highlight its utility in tasks such as simulation parameter analysis and representative timestep selection. The code is available at https://github.com/spacefarers/MetaINR.
中文: Meta-INR采用基于元学习的预训练策略,为隐式神经表示学习初始参数,从而在相似体积数据集上实现更快收敛和更强的泛化能力。
English: Meta-INR introduces a meta-learning-based pretraining approach to learn initial parameters for implicit neural representations, enabling faster convergence and improved generalizability across similar volumetric datasets.

Authors:Duc Kieu, Kien Do, Toan Nguyen, Dang Nguyen, Thin Nguyen
Title: Bidirectional Diffusion Bridge Models
Abstract:
Diffusion bridges have shown potential in paired image-to-image (I2I) translation tasks. However, existing methods are limited by their unidirectional nature, requiring separate models for forward and reverse translations. This not only doubles the computational cost but also restricts their practicality. In this work, we introduce the Bidirectional Diffusion Bridge Model (BDBM), a scalable approach that facilitates bidirectional translation between two coupled distributions using a single network. BDBM leverages the Chapman-Kolmogorov Equation for bridges, enabling it to model data distribution shifts across timesteps in both forward and backward directions by exploiting the interchangeability of the initial and target timesteps within this framework. Notably, when the marginal distribution given endpoints is Gaussian, BDBM's transition kernels in both directions possess analytical forms, allowing for efficient learning with a single network. We demonstrate the connection between BDBM and existing bridge methods, such as Doob's h-transform and variational approaches, and highlight its advantages. Extensive experiments on high-resolution I2I translation tasks demonstrate that BDBM not only enables bidirectional translation with minimal additional cost but also outperforms state-of-the-art bridge models. Our source code is available at [https://github.com/kvmduc/BDBM||https://github.com/kvmduc/BDBM].
Chinese: 双向扩散桥模型(BDBM)提出了一种可扩展的方法,通过单一网络实现双向图像翻译,不仅降低了计算成本,而且性能优于现有技术。
English: The Bidirectional Diffusion Bridge Model (BDBM) introduces a scalable approach for bidirectional image-to-image translation using a single network, reducing computational costs and outperforming existing methods.

Authors:Bowen Chen, Keyan Chen, Mohan Yang, Zhengxia Zou, Zhenwei Shi
Title: Heterogeneous Mixture of Experts for Remote Sensing Image Super-Resolution
Abstract:
Remote sensing image super-resolution (SR) aims to reconstruct high-resolution remote sensing images from low-resolution inputs, thereby addressing limitations imposed by sensors and imaging conditions. However, the inherent characteristics of remote sensing images, including diverse ground object types and complex details, pose significant challenges to achieving high-quality reconstruction. Existing methods typically employ a uniform structure to process various types of ground objects without distinction, making it difficult to adapt to the complex characteristics of remote sensing images. To address this issue, we introduce a Mixture of Experts (MoE) model and design a set of heterogeneous experts. These experts are organized into multiple expert groups, where experts within each group are homogeneous while being heterogeneous across groups. This design ensures that specialized activation parameters can be employed to handle the diverse and intricate details of ground objects effectively. To better accommodate the heterogeneous experts, we propose a multi-level feature aggregation strategy to guide the routing process. Additionally, we develop a dual-routing mechanism to adaptively select the optimal expert for each pixel. Experiments conducted on the UCMerced and AID datasets demonstrate that our proposed method achieves superior SR reconstruction accuracy compared to state-of-the-art methods. The code will be available at https://github.com/Mr-Bamboo/MFG-HMoE.
Chinese Summary: 本文提出了一种混合专家模型,通过异构专家组和双路由机制有效处理遥感图像的多样化特征,在基准数据集上实现了更优的超分辨率重建精度。
English Summary: This paper introduces a Mixture of Experts model with heterogeneous expert groups and a dual-routing mechanism to effectively handle the diverse characteristics of remote sensing images for superior super-resolution reconstruction, demonstrating improved accuracy on benchmark datasets.

Authors:Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, Zhiqiang Xu
Title: Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
Abstract:
The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle: Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduce Selective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, suppressing a series of DPO variants with different algorithmic adjustments. Together, these results illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorgao/SelectiveDPO.
中文: 本研究挑战了传统上依赖更多数据对齐大语言模型的做法,通过证明过于困难的示例会损害性能,并提出了选择性DPO方法,通过过滤这些示例将对齐效果在胜率上提升了9-16%。
English: The study challenges the conventional approach of using more data for aligning large language models by demonstrating that overly difficult examples hinder performance and introduces Selective DPO, a method that filters such examples to improve alignment outcomes by 9-16% in win rates.

Authors:Sougata Saha, Saurabh Kumar Pandey, Harshit Gupta, Monojit Choudhury
Title: Reading between the Lines: Can LLMs Identify Cross-Cultural Communication Gaps?
Abstract:
In a rapidly globalizing and digital world, content such as book and product reviews created by people from diverse cultures are read and consumed by others from different corners of the world. In this paper, we investigate the extent and patterns of gaps in understandability of book reviews due to the presence of culturally-specific items and elements that might be alien to users from another culture. Our user-study on 57 book reviews from Goodreads reveal that 83\% of the reviews had at least one culture-specific difficult-to-understand element. We also evaluate the efficacy of GPT-4o in identifying such items, given the cultural background of the reader; the results are mixed, implying a significant scope for improvement. Our datasets are available here: https://github.com/sougata-ub/reading_between_lines
中文: 本研究探讨了Goodreads书评中的文化特定元素如何造成国际读者的理解障碍,发现83%的书评存在此类文化隔阂,同时评估了GPT-4o在识别这些文化参照物方面效果有限。
English: This study examines how culturally-specific elements in book reviews from Goodreads create comprehension gaps for international readers, finding that 83% of reviews contain such barriers, and evaluates GPT-4o's limited effectiveness in identifying these cultural references.

Authors:Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
Title: Exploring the Potential of Encoder-free Architectures in 3D LMMs
Abstract:
Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.10%, 50.98%, and 43.10% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL
中文: 本文提出首个无编码器的3D大模型ENEL,通过预训练阶段的语义编码策略和指令调优阶段的层次几何聚合,使大语言模型直接处理3D点云,在分类、描述和视觉问答任务上达到与更大编码器模型相当的性能。
English: This paper introduces ENEL, the first encoder-free 3D Large Multimodal Model that eliminates traditional 3D encoders by embedding semantic encoding during pre-training and hierarchical geometry aggregation during fine-tuning, achieving performance comparable to larger encoder-based models across classification, captioning, and VQA tasks.

Authors:Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
Title: DexTrack: Towards Generalizable Neural Tracking Control for Dexterous Manipulation from Human References
Abstract:
We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such a controller is complicated by the intricate contact dynamics of dexterous manipulation and the need for adaptivity, generalizability, and robustness. Current reinforcement learning and trajectory optimization methods often fall short due to their dependence on task-specific rewards or precise system models. We introduce an approach that curates large-scale successful robot tracking demonstrations, comprising pairs of human references and robot actions, to train a neural controller. Utilizing a data flywheel, we iteratively enhance the controller's performance, as well as the number and quality of successful tracking demonstrations. We exploit available tracking demonstrations and carefully integrate reinforcement learning and imitation learning to boost the controller's performance in dynamic environments. At the same time, to obtain high-quality tracking demonstrations, we individually optimize per-trajectory tracking by leveraging the learned tracking controller in a homotopy optimization method. The homotopy optimization, mimicking chain-of-thought, aids in solving challenging trajectory tracking problems to increase demonstration diversity. We showcase our success by training a generalizable neural controller and evaluating it in both simulation and real world. Our method achieves over a 10% improvement in success rates compared to leading baselines. The project website with animated results is available at https://meowuu7.github.io/DexTrack/.
Chinese: 本研究通过整合人类-机器人示范数据并结合模仿学习与同伦优化方法,开发出适用于灵巧操作的通用神经控制器,在真实环境测试中相比主流基线模型成功率提升超过10%。
English: This study develops a generalizable neural controller for dexterous robotic manipulation by curating human-robot demonstration pairs and integrating imitation learning with homotopy optimization, achieving over 10% higher success rates than leading baselines in real-world evaluations.

Authors:Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
Title: SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Abstract:
We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks. The source code is available at https://github.com/facebookresearch/SelfCite
中文: SelfCite是一种自监督方法,通过上下文消融生成奖励信号,引导LLM在推理时优化采样和微调,显著提升句子级引用的质量,在基准测试中F1分数最高提升5.3分。
English: SelfCite is a self-supervised method that enhances LLMs' sentence-level citation accuracy by using context ablation to generate rewards, improving citation quality through sampling and fine-tuning, achieving up to a 5.3-point F1 score increase on benchmarks.

Authors:Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang
Title: CoT-Valve: Length-Compressible Chain-of-Thought Tuning
Abstract:
Chain-of-Thought significantly enhances a model's reasoning capability, but it also comes with a considerable increase in inference costs due to long chains. With the observation that the reasoning path can be easily compressed under easy tasks but struggle on hard tasks, we explore the feasibility of elastically controlling the length of reasoning paths with only one model, thereby reducing the inference overhead of reasoning models dynamically based on task difficulty. We introduce a new tuning and inference strategy named CoT-Valve, designed to allow models to generate reasoning chains of varying lengths. To achieve this, we propose to identify a direction in the parameter space that, when manipulated, can effectively control the length of generated CoT. Moreover, we show that this property is valuable for compressing the reasoning chain. We construct datasets with chains from long to short for the same questions and explore two enhanced strategies for CoT-Valve: (1) a precise length-compressible CoT tuning method, and (2) a progressive chain length compression approach. Our experiments show that CoT-Valve successfully enables controllability and compressibility of the chain and shows better performance than the prompt-based control. We applied this method to QwQ-32B-Preview, reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with only one additional incorrect answer.
中文: CoT-Valve实现了对模型推理链长度的动态控制,有效降低推理开销,在复杂任务上以极小的性能损失显著压缩了推理链长度。
English: CoT-Valve enables dynamic control of reasoning chain lengths in models to reduce inference costs, achieving significant token compression with minimal performance loss on complex tasks.

Authors:Montgomery Bohde, Mrunali Manjrekar, Runzhong Wang, Shuiwang Ji, Connor W. Coley
Title: DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra
Abstract:
Mass spectrometry plays a fundamental role in elucidating the structures of unknown molecules and subsequent scientific discoveries. One formulation of the structure elucidation task is the conditional de novo generation of molecular structure given a mass spectrum. Toward a more accurate and efficient scientific discovery pipeline for small molecules, we present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task. The encoder utilizes a transformer architecture and models mass spectra domain knowledge such as peak formulae and neutral losses, and the decoder is a discrete graph diffusion model restricted by the heavy-atom composition of a known chemical formula. To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs, which are available in virtually infinite quantities, compared to structure-spectrum pairs that number in the tens of thousands. Extensive experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation. We provide several ablations to demonstrate the effectiveness of our diffusion and pretraining approaches and show consistent performance scaling with increasing pretraining dataset size. DiffMS code is publicly available at https://github.com/coleygroup/DiffMS.
中文:DiffMS是一种先进的生成网络,结合了Transformer编码器和离散图扩散解码器,能够根据质谱精确生成分子结构,在从头分子生成任务中超越了现有模型。
English: DiffMS is a state-of-the-art generative network that uses a transformer encoder and discrete graph diffusion decoder to accurately generate molecular structures from mass spectra, outperforming existing models in de novo molecule generation.

Authors:Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, Shu Wu, Liang Wang
Title: Diffusion Models for Molecules: A Survey of Methods and Tasks
Abstract:
Generative tasks about molecules, including but not limited to molecule generation, are crucial for drug discovery and material design, and have consistently attracted significant attention. In recent years, diffusion models have emerged as an impressive class of deep generative models, sparking extensive research and leading to numerous studies on their application to molecular generative tasks. Despite the proliferation of related work, there remains a notable lack of up-to-date and systematic surveys in this area. Particularly, due to the diversity of diffusion model formulations, molecular data modalities, and generative task types, the research landscape is challenging to navigate, hindering understanding and limiting the area's growth. To address this, this paper conducts a comprehensive survey of diffusion model-based molecular generative methods. We systematically review the research from the perspectives of methodological formulations, data modalities, and task types, offering a novel taxonomy. This survey aims to facilitate understanding and further flourishing development in this area. The relevant papers are summarized at: https://github.com/AzureLeon1/awesome-molecular-diffusion-models.
中文摘要:本文对基于扩散模型的分子生成方法进行了全面综述,从方法框架、数据模态和任务类型角度系统梳理了相关研究,旨在促进该领域的深入理解和蓬勃发展。
English Summary: This paper provides a comprehensive survey of diffusion model-based molecular generative methods, systematically reviewing them through methodological formulations, data modalities, and task types to facilitate understanding and development in this rapidly growing field.

Authors:Nicholas Dronen, Randall Balestriero
Title: Eidetic Learning: an Efficient and Provable Solution to Catastrophic Forgetting
Abstract:
Catastrophic forgetting -- the phenomenon of a neural network learning a task t1 and losing the ability to perform it after being trained on some other task t2 -- is a long-standing problem for neural networks [McCloskey and Cohen, 1989]. We present a method, Eidetic Learning, that provably solves catastrophic forgetting. A network trained with Eidetic Learning -- here, an EideticNet -- requires no rehearsal or replay. We consider successive discrete tasks and show how at inference time an EideticNet automatically routes new instances without auxiliary task information. An EideticNet bears a family resemblance to the sparsely-gated Mixture-of-Experts layer Shazeer et al. [2016] in that network capacity is partitioned across tasks and the network itself performs data-conditional routing. An EideticNet is easy to implement and train, is efficient, and has time and space complexity linear in the number of parameters. The guarantee of our method holds for normalization layers of modern neural networks during both pre-training and fine-tuning. We show with a variety of network architectures and sets of tasks that EideticNets are immune to forgetting. While the practical benefits of EideticNets are substantial, we believe they can be benefit practitioners and theorists alike. The code for training EideticNets is available at https://github.com/amazon-science/eideticnet-training.
中文: 记忆学习法通过将网络容量按任务划分并实现自动数据条件路由,有效解决了神经网络的灾难性遗忘问题,无需回放或复习机制。
English: Eidetic Learning is a method that effectively prevents catastrophic forgetting in neural networks by partitioning capacity across tasks and enabling automatic data-conditional routing without requiring rehearsal or replay.

Authors:Yi Yu, Xue Yang, Yansheng Li, Zhenjun Han, Feipeng Da, Junchi Yan
Title: Wholly-WOOD: Wholly Leveraging Diversified-quality Labels for Weakly-supervised Oriented Object Detection
Abstract:
Accurately estimating the orientation of visual objects with compact rotated bounding boxes (RBoxes) has become a prominent demand, which challenges existing object detection paradigms that only use horizontal bounding boxes (HBoxes). To equip the detectors with orientation awareness, supervised regression/classification modules have been introduced at the high cost of rotation annotation. Meanwhile, some existing datasets with oriented objects are already annotated with horizontal boxes or even single points. It becomes attractive yet remains open for effectively utilizing weaker single point and horizontal annotations to train an oriented object detector (OOD). We develop Wholly-WOOD, a weakly-supervised OOD framework, capable of wholly leveraging various labeling forms (Points, HBoxes, RBoxes, and their combination) in a unified fashion. By only using HBox for training, our Wholly-WOOD achieves performance very close to that of the RBox-trained counterpart on remote sensing and other areas, significantly reducing the tedious efforts on labor-intensive annotation for oriented objects. The source codes are available at https://github.com/VisionXLab/whollywood (PyTorch-based) and https://github.com/VisionXLab/whollywood-jittor (Jittor-based).
中文: 该研究提出了Wholly-WOOD弱监督框架,能统一利用点、水平框和旋转框等多种标注形式训练定向物体检测器,仅用水平框即可达到接近全标注的性能,大幅降低标注成本。
English: The study introduces Wholly-WOOD, a weakly-supervised framework that effectively trains oriented object detectors using various annotation forms like points, horizontal boxes, and rotated boxes, achieving near-optimal performance with minimal annotation effort.

Authors:Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
Title: Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Abstract:
Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://github.com/ccccai239/PixelRIST.
中文: 本研究提出了基于多轮对话的像素级推理分割新任务,通过构建PRIST数据集和MIRAS框架,实现了对动态用户意图的追踪与精细分割,在分割效果和推理指标上均优于现有基准方法。
English: This work introduces Pixel-level Reasoning Segmentation (Pixel-level RS), a novel task that tracks evolving user intent through multi-turn conversations for fine-grained segmentation, supported by the PRIST dataset and MIRAS framework, which outperform existing methods in segmentation and reasoning metrics.

Authors:Xiaoliu Guan, Yu Wu, Huayang Huang, Xiao Liu, Jiaxu Miao, Yi Yang
Title: Redistribute Ensemble Training for Mitigating Memorization in Diffusion Models
Abstract:
Diffusion models, known for their tremendous ability to generate high-quality samples, have recently raised concerns due to their data memorization behavior, which poses privacy risks. Recent methods for memory mitigation have primarily addressed the issue within the context of the text modality in cross-modal generation tasks, restricting their applicability to specific conditions. In this paper, we propose a novel method for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. Directly exposing visual data to the model increases memorization risk, so we design a framework where models learn through proxy model parameters instead. Specially, the training dataset is divided into multiple shards, with each shard training a proxy model, then aggregated to form the final model. Additionally, practical analysis of training losses illustrates that the losses for easily memorable images tend to be obviously lower. Thus, we skip the samples with abnormally low loss values from the current mini-batch to avoid memorizing. However, balancing the need to skip memorization-prone samples while maintaining sufficient training data for high-quality image generation presents a key challenge. Thus, we propose IET-AGC+, which redistributes highly memorizable samples between shards, to mitigate these samples from over-skipping. Furthermore, we dynamically augment samples based on their loss values to further reduce memorization. Extensive experiments and analysis on four datasets show that our method successfully reduces memory capacity while maintaining performance. Moreover, we fine-tune the pre-trained diffusion models, e.g., Stable Diffusion, and decrease the memorization score by 46.7\%, demonstrating the effectiveness of our method. Code is available in: https://github.com/liuxiao-guan/IET_AGC.
中文摘要:本文提出一种通过代理模型参数和动态样本管理的新方法,有效降低扩散模型的数据记忆风险,在保持图像生成质量的同时显著提升隐私保护能力。
English Summary: This paper introduces a novel method to mitigate data memorization in diffusion models by using proxy model parameters and dynamic sample management, effectively reducing privacy risks while maintaining image generation quality.

Authors:Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
Title: SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models
Abstract:
In the rapidly evolving field of Natural Language Processing, Large Language Models (LLMs) are tasked with increasingly complex reasoning challenges. Traditional methods like chain-of-thought prompting have shown promise but often fall short in fully leveraging a model's reasoning capabilities. This paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a novel prompting technique designed to improve reasoning through a self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts models to generate and resolve multiple auxiliary questions before tackling the main query, promoting a more thorough exploration of various aspects of a topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models across multiple question-answering datasets, demonstrate that SQuARE significantly surpasses traditional CoT prompts and existing rephrase-and-respond methods. By systematically decomposing queries, SQuARE advances LLM capabilities in reasoning tasks. The code is publicly available at https://github.com/IntelLabs/RAG-FiT/tree/square.
Chinese: 本文提出SQuARE这一新颖的自询问提示技术,通过让模型在回答主问题前生成并解决多个辅助问题来增强推理能力,在Llama 3和GPT-4o的评估中显著超越了传统思维链等方法的性能表现。
English: This paper introduces SQuARE, a novel self-interrogation prompting technique that enhances LLM reasoning by generating and resolving auxiliary questions before addressing main queries, significantly outperforming traditional methods like chain-of-thought in evaluations with Llama 3 and GPT-4o.

Authors:Khawla Elhadri, Tomasz Michalski, Adam Wróbel, Jörg Schlötterer, Bartosz Zieliński, Christin Seifert
Title: This looks like what? Challenges and Future Research Directions for Part-Prototype Models
Abstract:
The growing interest in eXplainable Artificial Intelligence (XAI) has prompted research into models with built-in interpretability, the most prominent of which are part-prototype models. Part-Prototype Models (PPMs) make decisions by comparing an input image to a set of learned prototypes, providing human-understandable explanations in the form of ``this looks like that''. Despite their inherent interpretability, PPMS are not yet considered a valuable alternative to post-hoc models. In this survey, we investigate the reasons for this and provide directions for future research. We analyze papers from 2019 to 2024, and derive a taxonomy of the challenges that current PPMS face. Our analysis shows that the open challenges are quite diverse. The main concern is the quality and quantity of prototypes. Other concerns are the lack of generalization to a variety of tasks and contexts, and general methodological issues, including non-standardized evaluation. We provide ideas for future research in five broad directions: improving predictive performance, developing novel architectures grounded in theory, establishing frameworks for human-AI collaboration, aligning models with humans, and establishing metrics and benchmarks for evaluation. We hope that this survey will stimulate research and promote intrinsically interpretable models for application domains. Our list of surveyed papers is available at https://github.com/aix-group/ppm-survey.
中文: 本综述分析了2019至2024年的部件原型模型,揭示了原型质量和方法论等核心挑战,并提出五个研究方向以推进这种本质可解释的AI模型发展。
English: This survey analyzes part-prototype models (PPMs) from 2019-2024, identifying key challenges like prototype quality and methodological issues while proposing five research directions to advance these inherently interpretable AI models.

Authors:Jiayang Wu, Wensheng Gan, Philip S. Yu
Title: Graph Diffusion Network for Drug-Gene Prediction
Abstract:
Predicting drug-gene associations is crucial for drug development and disease treatment. While graph neural networks (GNN) have shown effectiveness in this task, they face challenges with data sparsity and efficient contrastive learning implementation. We introduce a graph diffusion network for drug-gene prediction (GDNDGP), a framework that addresses these limitations through two key innovations. First, it employs meta-path-based homogeneous graph learning to capture drug-drug and gene-gene relationships, ensuring similar entities share embedding spaces. Second, it incorporates a parallel diffusion network that generates hard negative samples during training, eliminating the need for exhaustive negative sample retrieval. Our model achieves superior performance on the DGIdb 4.0 dataset and demonstrates strong generalization capability on tripartite drug-gene-disease networks. Results show significant improvements over existing methods in drug-gene prediction tasks, particularly in handling complex heterogeneous relationships. The source code is publicly available at https://github.com/csjywu1/GDNDGP.
Chinese Summary: 提出的GDNDGP框架通过整合基于元路径的同构图学习和并行扩散网络生成困难负样本,显著提升了药物-基因关联预测性能,在基准数据集上实现了最优效果。
English Summary: The proposed GDNDGP framework enhances drug-gene association prediction by integrating meta-path-based homogeneous graph learning and a parallel diffusion network for hard negative sampling, achieving state-of-the-art performance on benchmark datasets.

Authors:Chen Xu, Yuxin Li, Wenjie Wang, Liang Pang, Jun Xu, Tat-Seng Chua
Title: Bridging Jensen Gap for Max-Min Group Fairness Optimization in Recommendation
Abstract:
Group max-min fairness (MMF) is commonly used in fairness-aware recommender systems (RS) as an optimization objective, as it aims to protect marginalized item groups and ensures a fair competition platform. However, our theoretical analysis indicates that integrating MMF constraint violates the assumption of sample independence during optimization, causing the loss function to deviate from linear additivity. Such nonlinearity property introduces the Jensen gap between the model's convergence point and the optimal point if mini-batch sampling is applied. Both theoretical and empirical studies show that as the mini-batch size decreases and the group size increases, the Jensen gap will widen accordingly. Some methods using heuristic re-weighting or debiasing strategies have the potential to bridge the Jensen gap. However, they either lack theoretical guarantees or suffer from heavy computational costs. To overcome these limitations, we first theoretically demonstrate that the MMF-constrained objective can be essentially reformulated as a group-weighted optimization objective. Then we present an efficient and effective algorithm named FairDual, which utilizes a dual optimization technique to minimize the Jensen gap. Our theoretical analysis demonstrates that FairDual can achieve a sub-linear convergence rate to the globally optimal solution and the Jensen gap can be well bounded under a mini-batch sampling strategy with random shuffle. Extensive experiments conducted using six large-scale RS backbone models on three publicly available datasets demonstrate that FairDual outperforms all baselines in terms of both accuracy and fairness. Our data and codes are shared at https://github.com/XuChen0427/FairDual.
中文: 研究表明,推荐系统中的组最大最小公平性会因非线性引入Jensen间隙,并提出了FairDual双重优化算法,能在保证准确性和公平性的同时有效缩小该间隙。
English: The study reveals that group max-min fairness in recommender systems introduces a Jensen gap due to nonlinearity, and proposes FairDual, a dual optimization algorithm that effectively minimizes this gap while ensuring both accuracy and fairness.

Authors:Daniel Koutas, Daniel Hettegger, Kostas G. Papakonstantinou, Daniel Straub
Title: Convex Is Back: Solving Belief MDPs With Convexity-Informed Deep Reinforcement Learning
Abstract:
We present a novel method for Deep Reinforcement Learning (DRL), incorporating the convex property of the value function over the belief space in Partially Observable Markov Decision Processes (POMDPs). We introduce hard- and soft-enforced convexity as two different approaches, and compare their performance against standard DRL on two well-known POMDP environments, namely the Tiger and FieldVisionRockSample problems. Our findings show that including the convexity feature can substantially increase performance of the agents, as well as increase robustness over the hyperparameter space, especially when testing on out-of-distribution domains. The source code for this work can be found at https://github.com/Dakout/Convex_DRL.
中文: 本研究提出了一种新颖的深度强化学习方法,通过在部分可观测马尔可夫决策过程中强化价值函数的凸性,显著提升了智能体性能并增强了超参数和分布外测试的鲁棒性。
English: This study introduces a novel Deep Reinforcement Learning method that enforces convexity of the value function in POMDPs, demonstrating significant performance improvements and enhanced robustness across hyperparameters and out-of-distribution domains.

Authors:Mojtaba Safari, Shansong Wang, Zach Eidex, Richard Qiu, Chih-Wei Chang, David S. Yu, Xiaofeng Yang
Title: A Physics-Informed Deep Learning Model for MRI Brain Motion Correction
Abstract:
Background: MRI is crucial for brain imaging but is highly susceptible to motion artifacts due to long acquisition times. This study introduces PI-MoCoNet, a physics-informed motion correction network that integrates spatial and k-space information to remove motion artifacts without explicit motion parameter estimation, enhancing image fidelity and diagnostic reliability. Materials and Methods: PI-MoCoNet consists of a motion detection network (U-net with spatial averaging) to identify corrupted k-space lines and a motion correction network (U-net with Swin Transformer blocks) to reconstruct motion-free images. The correction is guided by three loss functions: reconstruction (L1), perceptual (LPIPS), and data consistency (Ldc). Motion artifacts were simulated via rigid phase encoding perturbations and evaluated on IXI and MR-ART datasets against Pix2Pix, CycleGAN, and U-net using PSNR, SSIM, and NMSE. Results: PI-MoCoNet significantly improved image quality. On IXI, for minor artifacts, PSNR increased from 34.15 dB to 45.95 dB, SSIM from 0.87 to 1.00, and NMSE reduced from 0.55% to 0.04%. For moderate artifacts, PSNR improved from 30.23 dB to 42.16 dB, SSIM from 0.80 to 0.99, and NMSE from 1.32% to 0.09%. For heavy artifacts, PSNR rose from 27.99 dB to 36.01 dB, SSIM from 0.75 to 0.97, and NMSE decreased from 2.21% to 0.36%. On MR-ART, PI-MoCoNet achieved PSNR gains of ~10 dB and SSIM improvements of up to 0.20, with NMSE reductions of ~6%. Ablation studies confirmed the importance of data consistency and perceptual losses, yielding a 1 dB PSNR gain and 0.17% NMSE reduction. Conclusions: PI-MoCoNet effectively mitigates motion artifacts in brain MRI, outperforming existing methods. Its ability to integrate spatial and k-space information makes it a promising tool for clinical use in motion-prone settings. Code: https://github.com/mosaf/PI-MoCoNet.git.
Chinese: PI-MoCoNet 是一种物理信息驱动的运动校正网络,通过整合空间和k空间数据,有效消除脑部MRI中的运动伪影,显著提升图像质量,且无需显式运动估计,性能优于现有方法。
English: PI-MoCoNet is a physics-informed motion correction network that integrates spatial and k-space data to effectively remove motion artifacts in brain MRI, significantly enhancing image quality and outperforming existing methods without explicit motion estimation.

Authors:Yuankai Luo, Lei Shi, Xiao-Ming Wu
Title: Can Classic GNNs Be Strong Baselines for Graph-level Tasks? Simple Architectures Meet Excellence
Abstract:
Message-passing Graph Neural Networks (GNNs) are often criticized for their limited expressiveness, issues like over-smoothing and over-squashing, and challenges in capturing long-range dependencies. Conversely, Graph Transformers (GTs) are regarded as superior due to their employment of global attention mechanisms, which potentially mitigate these challenges. Literature frequently suggests that GTs outperform GNNs in graph-level tasks, especially for graph classification and regression on small molecular graphs. In this study, we explore the untapped potential of GNNs through an enhanced framework, GNN+, which integrates six widely used techniques: edge feature integration, normalization, dropout, residual connections, feed-forward networks, and positional encoding, to effectively tackle graph-level tasks. We conduct a systematic re-evaluation of three classic GNNs (GCN, GIN, and GatedGCN) enhanced by the GNN+ framework across 14 well-known graph-level datasets. Our results reveal that, contrary to prevailing beliefs, these classic GNNs consistently match or surpass the performance of GTs, securing top-three rankings across all datasets and achieving first place in eight. Furthermore, they demonstrate greater efficiency, running several times faster than GTs on many datasets. This highlights the potential of simple GNN architectures, challenging the notion that complex mechanisms in GTs are essential for superior graph-level performance. Our source code is available at https://github.com/LUOyk1999/GNNPlus.
中文: 本研究通过集成六种技术的GNN+框架增强经典图神经网络,证明其在图级任务中不仅能匹配甚至超越图Transformer的性能,且效率更高,从而挑战了复杂GT架构必然优越的普遍认知。
English: This study demonstrates that enhanced classic GNNs, through the GNN+ framework integrating six techniques, can match or exceed Graph Transformers' performance in graph-level tasks while being more efficient, challenging the prevailing superiority of complex GT architectures.

Authors:Daocheng Fu, Naiting Zhong, Xu Han, Pinlong Cai, Licheng Wen, Song Mao, Botian Shi, Yu Qiao
Title: LimSim Series: An Autonomous Driving Simulation Platform for Validation and Enhancement
Abstract:
Closed-loop simulation environments play a crucial role in the validation and enhancement of autonomous driving systems (ADS). However, certain challenges warrant significant attention, including balancing simulation accuracy with duration, reconciling functionality with practicality, and establishing comprehensive evaluation mechanisms. This paper addresses these challenges by introducing the LimSim Series, a comprehensive simulation platform designed to support the rapid deployment and efficient iteration of ADS. The LimSim Series integrates multi-type information from road networks, employs human-like decision-making and planning algorithms for background vehicles, and introduces the concept of the Area of Interest (AoI) to optimize computational resources. The platform offers a variety of baseline algorithms and user-friendly interfaces, facilitating flexible validation of multiple technical pipelines. Additionally, the LimSim Series incorporates multi-dimensional evaluation metrics, delivering thorough insights into system performance, thus enabling researchers to promptly identify issues for further improvements. Experiments demonstrate that the LimSim Series is compatible with modular, end-to-end, and VLM-based knowledge-driven systems. It can assist in the iteration and updating of ADS by evaluating performance across various scenarios. The code of the LimSim Series is released at: https://github.com/PJLab-ADG/LimSim.
中文: 本文提出的LimSim系列仿真平台通过整合多类型道路数据、拟人化算法和优化计算资源,解决了自动驾驶系统验证中的关键挑战,支持灵活测试和多维评估。
English: This paper introduces the LimSim Series, a comprehensive simulation platform that addresses key challenges in autonomous driving system validation by integrating multi-type road data, human-like algorithms, and optimized computational resources to enable flexible testing and multi-dimensional evaluation.

Authors:Giuseppe Fasano, Yashar Deldjoo, Tommaso di Noia, Bianca Lau, Sina Adham-Khiabani, Eric Morris, Xia Liu, Ganga Chinna Rao Devarapu, Liam O'Faolain
Title: Use of Air Quality Sensor Network Data for Real-time Pollution-Aware POI Suggestion
Abstract:
This demo paper introduces AirSense-R, a privacy-preserving mobile application that delivers real-time, pollution-aware recommendations for urban points of interest (POIs). By merging live air quality data from AirSENCE sensor networks in Bari (Italy) and Cork (Ireland) with user preferences, the system enables health-conscious decision-making. It employs collaborative filtering for personalization, federated learning for privacy, and a prediction engine to detect anomalies and interpolate sparse sensor data. The proposed solution adapts dynamically to urban air quality while safeguarding user privacy. The code and demonstration video are available at https://github.com/AirtownApp/Airtown-Application.git.
中文: AirSense-R 是一款保护隐私的移动应用,通过融合实时空气质量数据与用户偏好,为城市兴趣点提供污染感知的实时推荐,并采用协同过滤、联邦学习和预测引擎来确保个性化和隐私保护。
English: AirSense-R is a privacy-preserving mobile app that provides real-time, pollution-aware recommendations for urban points of interest by integrating live air quality data with user preferences, utilizing collaborative filtering, federated learning, and a prediction engine to ensure personalization and privacy.

Authors:Shihao Zhang, Yuguang Yan, Angela Yao
Title: Improving Deep Regression with Tightness
Abstract:
For deep regression, preserving the ordinality of the targets with respect to the feature representation improves performance across various tasks. However, a theoretical explanation for the benefits of ordinality is still lacking. This work reveals that preserving ordinality reduces the conditional entropy $H(Z|Y)$ of representation $Z$ conditional on the target $Y$. However, our findings reveal that typical regression losses do little to reduce $H(Z|Y)$, even though it is vital for generalization performance. With this motivation, we introduce an optimal transport-based regularizer to preserve the similarity relationships of targets in the feature space to reduce $H(Z|Y)$. Additionally, we introduce a simple yet efficient strategy of duplicating the regressor targets, also with the aim of reducing $H(Z|Y)$. Experiments on three real-world regression tasks verify the effectiveness of our strategies to improve deep regression. Code: https://github.com/needylove/Regression_tightness.
中文: 在深度回归中保持目标的有序性可降低条件熵H(Z|Y)以提升性能,本研究通过引入最优传输正则化和目标复制策略有效减少了该熵,并在三个实际任务中验证了其有效性。
English: Preserving ordinality in deep regression reduces conditional entropy H(Z|Y) to enhance performance, and this work introduces both an optimal transport regularizer and a target duplication strategy to effectively minimize this entropy, validated across three real-world tasks.

Authors:Rubén Pérez-Jove, Cristian R. Munteanu, Alejandro Pazos, Jose Vázquez-Naya
Title: Application of Tabular Transformer Architectures for Operating System Fingerprinting
Abstract:
Operating System (OS) fingerprinting is essential for network management and cybersecurity, enabling accurate device identification based on network traffic analysis. Traditional rule-based tools such as Nmap and p0f face challenges in dynamic environments due to frequent OS updates and obfuscation techniques. While Machine Learning (ML) approaches have been explored, Deep Learning (DL) models, particularly Transformer architectures, remain unexploited in this domain. This study investigates the application of Tabular Transformer architectures-specifically TabTransformer and FT-Transformer-for OS fingerprinting, leveraging structured network data from three publicly available datasets. Our experiments demonstrate that FT-Transformer generally outperforms traditional ML models, previous approaches and TabTransformer across multiple classification levels (OS family, major, and minor versions). The results establish a strong foundation for DL-based OS fingerprinting, improving accuracy and adaptability in complex network environments. Furthermore, we ensure the reproducibility of our research by providing an open-source implementation.
中文: 本研究首次将表格Transformer架构应用于操作系统指纹识别,其中FT-Transformer在多个分类层级上显著优于传统方法,为复杂网络环境下的精准设备识别建立了新基准。
English: This study introduces Tabular Transformer architectures, particularly FT-Transformer, for OS fingerprinting, demonstrating superior accuracy over traditional methods and enhancing adaptability in complex network environments.

Authors:Jinhui Guo, Lubin Fan, Bojian Wu, Jiaqi Gu, Shen Cao, Jieping Ye
Title: PTZ-Calib: Robust Pan-Tilt-Zoom Camera Calibration
Abstract:
In this paper, we present PTZ-Calib, a robust two-stage PTZ camera calibration method, that efficiently and accurately estimates camera parameters for arbitrary viewpoints. Our method includes an offline and an online stage. In the offline stage, we first uniformly select a set of reference images that sufficiently overlap to encompass a complete 360° view. We then utilize the novel PTZ-IBA (PTZ Incremental Bundle Adjustment) algorithm to automatically calibrate the cameras within a local coordinate system. Additionally, for practical application, we can further optimize camera parameters and align them with the geographic coordinate system using extra global reference 3D information. In the online stage, we formulate the calibration of any new viewpoints as a relocalization problem. Our approach balances the accuracy and computational efficiency to meet real-world demands. Extensive evaluations demonstrate our robustness and superior performance over state-of-the-art methods on various real and synthetic datasets. Datasets and source code can be accessed online at https://github.com/gjgjh/PTZ-Calib
中文: PTZ-Calib提出了一种鲁棒的双阶段标定方法,通过离线PTZ-IBA优化和在线重定位,高效估算任意视角的相机参数,实现了卓越的精度与计算效率。
English: PTZ-Calib introduces a robust two-stage calibration method that efficiently estimates camera parameters for arbitrary viewpoints through offline PTZ-IBA optimization and online relocalization, achieving superior accuracy and computational efficiency.

Authors:Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, Gagandeep Singh
Title: CRANE: Reasoning with constrained LLM generation
Abstract:
Code generation, symbolic math reasoning, and other tasks require LLMs to produce outputs that are both syntactically and semantically correct. Constrained LLM generation is a promising direction to enforce adherence to formal grammar, but prior works have empirically observed that strict enforcement of formal constraints often diminishes the reasoning capabilities of LLMs. In this work, we first provide a theoretical explanation for why constraining LLM outputs to very restrictive grammars that only allow syntactically valid final answers reduces the reasoning capabilities of the model. Second, we demonstrate that by augmenting the output grammar with carefully designed additional rules, it is always possible to preserve the reasoning capabilities of the LLM while ensuring syntactic and semantic correctness in its outputs. Building on these theoretical insights, we propose a reasoning-augmented constrained decoding algorithm, CRANE, which effectively balances the correctness of constrained generation with the flexibility of unconstrained generation. Experiments on multiple open-source LLMs and benchmarks show that CRANE significantly outperforms both state-of-the-art constrained decoding strategies and standard unconstrained decoding, showing up to 10% points accuracy improvement over baselines on challenging symbolic reasoning benchmarks GSM-symbolic and FOLIO.
中文摘要:限制LLM输出至严格语法会削弱推理能力,但通过精心设计的附加规则扩充输出语法可在确保正确性的同时保留模型推理能力,CRANE算法在多项测试中显著优于现有方法。
English Summary: Constraining LLM outputs to strict grammars can hinder reasoning, but augmenting grammars with additional rules preserves reasoning capabilities while ensuring correctness, as demonstrated by the CRANE algorithm which significantly outperforms existing methods.

Authors:Shiryu Ueno, Yoshikazu Hayashi, Shunsuke Nakatsuka, Yusei Yamada, Hiroaki Aizawa, Kunihito Kato
Title: Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model
Abstract:
We propose general visual inspection model using Vision-Language Model~(VLM) with few-shot images of non-defective or defective products, along with explanatory texts that serve as inspection criteria. Although existing VLM exhibit high performance across various tasks, they are not trained on specific tasks such as visual inspection. Thus, we construct a dataset consisting of diverse images of non-defective and defective products collected from the web, along with unified formatted output text, and fine-tune VLM. For new products, our method employs In-Context Learning, which allows the model to perform inspections with an example of non-defective or defective image and the corresponding explanatory texts with visual prompts. This approach eliminates the need to collect a large number of training samples and re-train the model for each product. The experimental results show that our method achieves high performance, with MCC of 0.804 and F1-score of 0.950 on MVTec AD in a one-shot manner. Our code is available at~https://github.com/ia-gu/Vision-Language-In-Context-Learning-Driven-Few-Shot-Visual-Inspection-Model.
中文: 本研究提出了一种基于视觉语言模型的视觉检测方法,通过少量样本和解释性文本识别产品缺陷,无需大量重新训练即可实现高性能。
English: This study introduces a vision-language model for visual inspection that uses few-shot examples and explanatory texts to identify defects in products, achieving high performance without extensive retraining.

Authors:Lingting Zhu, Guying Lin, Jinnan Chen, Xinjie Zhang, Zhenchao Jin, Zhao Wang, Lequan Yu
Title: Large Images are Gaussians: High-Quality Large Image Representation with Levels of 2D Gaussian Splatting
Abstract:
While Implicit Neural Representations (INRs) have demonstrated significant success in image representation, they are often hindered by large training memory and slow decoding speed. Recently, Gaussian Splatting (GS) has emerged as a promising solution in 3D reconstruction due to its high-quality novel view synthesis and rapid rendering capabilities, positioning it as a valuable tool for a broad spectrum of applications. In particular, a GS-based representation, 2DGS, has shown potential for image fitting. In our work, we present \textbf{L}arge \textbf{I}mages are \textbf{G}aussians (\textbf{LIG}), which delves deeper into the application of 2DGS for image representations, addressing the challenge of fitting large images with 2DGS in the situation of numerous Gaussian points, through two distinct modifications: 1) we adopt a variant of representation and optimization strategy, facilitating the fitting of a large number of Gaussian points; 2) we propose a Level-of-Gaussian approach for reconstructing both coarse low-frequency initialization and fine high-frequency details. Consequently, we successfully represent large images as Gaussian points and achieve high-quality large image representation, demonstrating its efficacy across various types of large images. Code is available at {\href{https://github.com/HKU-MedAI/LIG}{https://github.com/HKU-MedAI/LIG}}.
中文:LIG方法通过优化高斯点管理和采用多层次重建策略,改进了2D高斯泼溅技术,成功实现了对大尺寸图像的高效高质量表示,并在多种图像类型上验证了其有效性。
English: The LIG method enhances 2D Gaussian Splatting to efficiently represent large images by optimizing Gaussian point management and employing a multi-level reconstruction strategy, achieving high-quality results across diverse image types.

Authors:Jun Yuan, Guohao Cai, Zhenhua Dong
Title: A Contextual-Aware Position Encoding for Sequential Recommendation
Abstract:
Sequential recommendation (SR), which encodes user activity to predict the next action, has emerged as a widely adopted strategy in developing commercial personalized recommendation systems. A critical component of modern SR models is the attention mechanism, which synthesizes users' historical activities. This mechanism is typically order-invariant and generally relies on position encoding (PE). Conventional SR models simply assign a learnable vector to each position, resulting in only modest gains compared to traditional recommendation models. Moreover, limited research has been conducted on position encoding tailored for sequential recommendation, leaving a significant gap in addressing its unique requirements. To bridge this gap, we propose a novel Contextual-Aware Position Encoding method for sequential recommendation, abbreviated as CAPE. To the best of our knowledge, CAPE is the first PE method specifically designed for sequential recommendation. Comprehensive experiments conducted on benchmark SR datasets demonstrate that CAPE consistently enhances multiple mainstream backbone models and achieves state-of-the-art performance, across small and large scale model size. Furthermore, we deployed CAPE in an industrial setting on a real-world commercial platform, clearly showcasing the effectiveness of our approach. Our source code is available at https://github.com/yjdy/CAPE.
中文: 本文提出了一种专为序列推荐设计的新型上下文感知位置编码方法CAPE,该方法在实验和工业应用中均能提升主流模型性能并达到最先进水平。
English: The paper introduces CAPE, a novel Contextual-Aware Position Encoding method specifically designed for sequential recommendation, which enhances mainstream models and achieves state-of-the-art performance in both experimental and industrial settings.

Authors:Xiao Wang, Jingtao Jiang, Dong Li, Futian Wang, Lin Zhu, Yaowei Wang, Yongyong Tian, Jin Tang
Title: EventSTR: A Benchmark Dataset and Baselines for Event Stream based Scene Text Recognition
Abstract:
Mainstream Scene Text Recognition (STR) algorithms are developed based on RGB cameras which are sensitive to challenging factors such as low illumination, motion blur, and cluttered backgrounds. In this paper, we propose to recognize the scene text using bio-inspired event cameras by collecting and annotating a large-scale benchmark dataset, termed EventSTR. It contains 9,928 high-definition (1280 * 720) event samples and involves both Chinese and English characters. We also benchmark multiple STR algorithms as the baselines for future works to compare. In addition, we propose a new event-based scene text recognition framework, termed SimC-ESTR. It first extracts the event features using a visual encoder and projects them into tokens using a Q-former module. More importantly, we propose to augment the vision tokens based on a memory mechanism before feeding into the large language models. A similarity-based error correction mechanism is embedded within the large language model to correct potential minor errors fundamentally based on contextual information. Extensive experiments on the newly proposed EventSTR dataset and two simulation STR datasets fully demonstrate the effectiveness of our proposed model. We believe that the dataset and algorithmic model can innovatively propose an event-based STR task and are expected to accelerate the application of event cameras in various industries. The source code and pre-trained models will be released on https://github.com/Event-AHU/EventSTR
中文: 本文提出了基于事件相机的大规模场景文本识别数据集EventSTR和新型框架SimC-ESTR,该框架通过记忆增强的视觉标记和基于相似度的纠错机制显著提升了识别性能,在多个数据集上验证了其有效性。
English: This paper introduces EventSTR, a large-scale dataset for bio-inspired event camera-based scene text recognition, and proposes SimC-ESTR, a novel framework that enhances text recognition through memory-augmented vision tokens and similarity-based error correction, demonstrating superior performance on multiple datasets.

Authors:Shin'ya Yamaguchi, Kosuke Nishida, Daiki Chijiwa, Yasutoshi Ida
Title: Zero-shot Concept Bottleneck Models
Abstract:
Concept bottleneck models (CBMs) are inherently interpretable and intervenable neural network models, which explain their final label prediction by the intermediate prediction of high-level semantic concepts. However, they require target task training to learn input-to-concept and concept-to-label mappings, incurring target dataset collections and training resources. In this paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which predict concepts and labels in a fully zero-shot manner without training neural networks. Z-CBMs utilize a large-scale concept bank, which is composed of millions of vocabulary extracted from the web, to describe arbitrary input in various domains. For the input-to-concept mapping, we introduce concept retrieval, which dynamically finds input-related concepts by the cross-modal search on the concept bank. In the concept-to-label inference, we apply concept regression to select essential concepts from the retrieved concepts by sparse linear regression. Through extensive experiments, we confirm that our Z-CBMs provide interpretable and intervenable concepts without any additional training. Code will be available at https://github.com/yshinya6/zcbm.
中文: 零样本概念瓶颈模型(Z-CBMs)通过从大规模网络概念库中动态检索相关概念,并利用稀疏回归筛选关键概念,实现了无需训练即可进行可解释和可干预的预测。
English: Zero-shot concept bottleneck models (Z-CBMs) enable interpretable predictions by dynamically retrieving relevant concepts from a large-scale web-based concept bank and selecting essential ones through sparse regression, eliminating the need for target task training.

Authors:Quan Wei, Chung-Yiu Yau, Hoi-To Wai, Yang Katie Zhao, Dongyeop Kang, Youngsuk Park, Mingyi Hong
Title: RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models
Abstract:
Supervised fine-tuning is a standard method for adapting pre-trained large language models (LLMs) to downstream tasks. Quantization has been recently studied as a post-training technique for efficient LLM deployment. To obtain quantized fine-tuned LLMs, conventional pipelines would first fine-tune the pre-trained models, followed by post-training quantization. This often yields suboptimal performance as it fails to leverage the synergy between fine-tuning and quantization. To effectively realize low-bit quantization of weights, activations and KV caches in LLMs, we propose an algorithm named Rotated Straight-Through-Estimator (RoSTE), which combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy that identifies an effective rotation configuration to reduce activation outliers. We provide theoretical insights on RoSTE by analyzing its prediction error when applied to an overparameterized least square quantized training problem. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration. Experiments on Pythia, Qwen and Llama models of different sizes demonstrate the effectiveness of RoSTE. Compared to existing post-SFT quantization baselines, our method consistently achieves superior performances across various tasks and different LLM architectures. Our code is available at https://github.com/OptimAI-Lab/RoSTE.
中文摘要:RoSTE是一种创新算法,结合量化感知微调与自适应旋转策略,有效提升大语言模型的低比特量化效果,相比传统方法在不同模型和任务中均表现更优。
English Summary: RoSTE is a novel algorithm that integrates quantization-aware fine-tuning with adaptive rotation to enhance low-bit quantization of LLMs, achieving superior performance across various models and tasks compared to conventional methods.

Authors:Hong Kiat Tan, Andrea L. Bertozzi
Title: Generic Structural Stability for $2 \times 2$ Systems of Hyperbolic Conservation Laws
Abstract:
This paper presents a proof of generic structural stability for Riemann solutions to $2 \times 2$ system of hyperbolic conservation laws in one spatial variable, without diffusive terms. This means that for almost every left and right state, shocks and rarefaction solutions of the same type are preserved via perturbations of the flux functions, the left state, and the right state. The main assumptions for this proof involve standard assumptions on strict hyperbolicity and genuine non-linearity, a technical assumption on directionality of rarefaction curves, and the regular manifold (submersion) assumption motivated by concepts in differential topology. We show that the structural stability of the Riemann solutions is related to the transversality of the Hugoniot loci and rarefaction curves in the state space. The regular manifold assumption is required to invoke a variant of a theorem from differential topology, Thom's parametric transversality theorem, to show the genericity of transversality of these curves. This in turn implies the genericity of structural stability. We then apply this theorem to two examples: the p-system and a $2 \times 2$ system governing the evolution of gravity-driven monodisperse particle-laden thin films. In particular, we illustrate how one can verify all the above assumptions for the former, and apply the theorem to different numerical and physical aspects of the system governing the latter.
中文: 本文证明了$2 \times 2$双曲守恒律中黎曼解的通用结构稳定性,表明在标准双曲性和横截性假设下,扰动能保持激波与稀疏波类型,并以p-系统和颗粒薄层系统为例进行了验证。
English: This paper proves the generic structural stability of Riemann solutions for $2 \times 2$ hyperbolic conservation laws, showing that perturbations preserve shock and rarefaction types under standard hyperbolicity and transversality assumptions, with applications to the p-system and particle-laden thin films.

Authors:Zihao Li, Xiao Lin, Zhining Liu, Jiaru Zou, Ziwei Wu, Lecheng Zheng, Dongqi Fu, Yada Zhu, Hendrik Hamann, Hanghang Tong, Jingrui He
Title: Language in the Flow of Time: Time-Series-Paired Texts Weaved into a Unified Temporal Narrative
Abstract:
While many advances in time series models focus exclusively on numerical data, research on multimodal time series, particularly those involving contextual textual information commonly encountered in real-world scenarios, remains in its infancy. With recent progress in large language models and time series learning, we revisit the integration of paired texts with time series through the Platonic Representation Hypothesis, which posits that representations of different modalities converge to shared spaces. In this context, we identify that time-series-paired texts may naturally exhibit periodic properties that closely mirror those of the original time series. Building on this insight, we propose a novel framework, Texts as Time Series (TaTS), which considers the time-series-paired texts to be auxiliary variables of the time series. TaTS can be plugged into any existing numerical-only time series models and enable them to handle time series data with paired texts effectively. Through extensive experiments on both multimodal time series forecasting and imputation tasks across benchmark datasets with various existing time series models, we demonstrate that TaTS can enhance predictive performance without modifying model architectures. Code available at https://github.com/iDEA-iSAIL-Lab-UIUC/TaTS.
中文摘要:提出的“文本即时间序列”(TaTS)框架通过将具有周期特性的配对文本视为时间序列的辅助变量,使现有数值时间序列模型无需改动架构即可有效融合文本数据,从而提升预测和填补任务的性能。
English Summary: The proposed Texts as Time Series (TaTS) framework enables existing numerical time series models to effectively incorporate paired textual data by treating texts as auxiliary variables with periodic properties, enhancing performance in forecasting and imputation tasks without architectural changes.

Authors:Kyungsu Kim, Junghyun Koo, Sungho Lee, Haesun Joung, Kyogu Lee
Title: TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument
Abstract:
Recent advancements in neural audio codecs have enabled the use of tokenized audio representations in various audio generation tasks, such as text-to-speech, text-to-audio, and text-to-music generation. Leveraging this approach, we propose TokenSynth, a novel neural synthesizer that utilizes a decoder-only transformer to generate desired audio tokens from MIDI tokens and CLAP (Contrastive Language-Audio Pretraining) embedding, which has timbre-related information. Our model is capable of performing instrument cloning, text-to-instrument synthesis, and text-guided timbre manipulation without any fine-tuning. This flexibility enables diverse sound design and intuitive timbre control. We evaluated the quality of the synthesized audio, the timbral similarity between synthesized and target audio/text, and synthesis accuracy (i.e., how accurately it follows the input MIDI) using objective measures. TokenSynth demonstrates the potential of leveraging advanced neural audio codecs and transformers to create powerful and versatile neural synthesizers. The source code, model weights, and audio demos are available at: https://github.com/KyungsuKim42/tokensynth
中文:TokenSynth是一种新型神经合成器,通过仅解码器转换器从MIDI令牌和音色嵌入生成音频,无需微调即可实现乐器克隆和文本引导的音色操控。
English: TokenSynth is a novel neural synthesizer that uses a decoder-only transformer to generate audio from MIDI tokens and timbre embeddings, enabling instrument cloning and text-guided manipulation without fine-tuning.

Authors:Max Rudolph, Nathan Lichtle, Sobhan Mohammadpour, Alexandre Bayen, J. Zico Kolter, Amy Zhang, Gabriele Farina, Eugene Vinitsky, Samuel Sokota
Title: Reevaluating Policy Gradient Methods for Imperfect-Information Games
Abstract:
In the past decade, motivated by the putative failure of naive self-play deep reinforcement learning (DRL) in adversarial imperfect-information games, researchers have developed numerous DRL algorithms based on fictitious play (FP), double oracle (DO), and counterfactual regret minimization (CFR). In light of recent results of the magnetic mirror descent algorithm, we hypothesize that simpler generic policy gradient methods like PPO are competitive with or superior to these FP-, DO-, and CFR-based DRL approaches. To facilitate the resolution of this hypothesis, we implement and release the first broadly accessible exact exploitability computations for four large games. Using these games, we conduct the largest-ever exploitability comparison of DRL algorithms for imperfect-information games. Over 5600 training runs, we find that FP-, DO-, and CFR-based approaches fail to outperform generic policy gradient methods. Code is available at https://github.com/nathanlct/IIG-RL-Benchmark and https://github.com/gabrfarina/exp-a-spiel .
Chinese: 最新研究表明,在非完美信息博弈中,简单的策略梯度方法(如PPO)优于基于虚拟博弈、双预言机和反事实遗憾最小化的复杂算法,这一结论基于超过5600次训练运行的大规模可利用性比较得出。
English: Recent research demonstrates that simpler policy gradient methods, such as PPO, outperform more complex approaches based on fictitious play, double oracle, and counterfactual regret minimization in imperfect-information games, as shown by extensive exploitability comparisons across over 5600 training runs.

Authors:Razvan-Gabriel Dumitru, Minglai Yang, Vikas Yadav, Mihai Surdeanu
Title: CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality
Abstract:
We introduce CopySpec, a simple yet effective technique to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs or responses that can be verbatim extracted from context. CopySpec identifies repeated sequences in the model's chat history or context and speculates that the same tokens will follow, enabling seamless copying without compromising output quality and without requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using seven LLMs and five datasets: MT-Bench, CNN/DM, GSM8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn's answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select MT-Redundant categories, and 2.66x on the third turn of GSM8K's self-correction tasks. Importantly, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context size grows, CopySpec leverages larger contexts to accelerate inference, making it a faster complementary solution. Our code and dataset are publicly available at https://github.com/RazvanDu/CopySpec.
中文: CopySpec 是一种通过识别并复用对话历史中的重复序列来加速大语言模型推理的技术,无需额外显存即可实现显著加速且不损失输出质量。
English: CopySpec is a technique that accelerates LLM inference by identifying and reusing repeated sequences from chat history or context, achieving significant speed-ups without extra GPU memory or quality loss.

Authors:Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari
Title: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Abstract:
Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. All resources are publicly available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.
中文: 大型语言模型存在幻觉和知识过时问题,检索增强生成通过整合外部动态信息来缓解,而多模态检索增强生成则融合文本、图像等多类数据以提升效果,但在跨模态对齐和推理方面带来了新挑战。
English: Large Language Models face issues like hallucinations and outdated knowledge, which Retrieval-Augmented Generation addresses by incorporating external dynamic information, and Multimodal RAG further enhances this by integrating multiple data types while presenting unique challenges in cross-modal alignment and reasoning.

Authors:Jocelyn Dzuong
Title: DejAIvu: Identifying and Explaining AI Art on the Web in Real-Time with Saliency Maps
Abstract:
The recent surge in advanced generative models, such as diffusion models and generative adversarial networks (GANs), has led to an alarming rise in AI-generated images across various domains on the web. While such technologies offer benefits such as democratizing artistic creation, they also pose challenges in misinformation, digital forgery, and authenticity verification. Additionally, the uncredited use of AI-generated images in media and marketing has sparked significant backlash from online communities. In response to this, we introduce DejAIvu, a Chrome Web extension that combines real-time AI-generated image detection with saliency-based explainability while users browse the web. Using an ONNX-optimized deep learning model, DejAIvu automatically analyzes images on websites such as Google Images, identifies AI-generated content using model inference, and overlays a saliency heatmap to highlight AI-related artifacts. Our approach integrates efficient in-browser inference, gradient-based saliency analysis, and a seamless user experience, ensuring that AI detection is both transparent and interpretable. We also evaluate DejAIvu across multiple pretrained architectures and benchmark datasets, demonstrating high accuracy and low latency, making it a practical and deployable tool for enhancing AI image accountability. The code for this system can be found at https://github.com/Noodulz/dejAIvu.
Chinese: DejAIvu 是一款 Chrome 浏览器扩展,通过优化的深度学习模型实时检测 AI 生成图像并提供基于显著性的可解释热力图,为增强 AI 图像可信度提供透明高效的解决方案。
English: The DejAIvu Chrome extension uses an optimized deep learning model to detect AI-generated images in real-time and provides explainable saliency heatmaps, offering a transparent and efficient solution for enhancing AI image accountability.

Authors:Zifan He, Anderson Truong, Yingqi Cao, Jason Cong
Title: InTAR: Inter-Task Auto-Reconfigurable Accelerator Design for High Data Volume Variation in DNNs
Abstract:
The rise of deep neural networks (DNNs) has driven an increased demand for computing power and memory. Modern DNNs exhibit high data volume variation (HDV) across tasks, which poses challenges for FPGA acceleration: conventional accelerators rely on fixed execution patterns (dataflow or sequential) that can lead to pipeline stalls or necessitate frequent off-chip memory accesses. To address these challenges, we introduce the Inter-Task Auto-Reconfigurable Accelerator (InTAR), a novel accelerator design methodology for HDV applications on FPGAs. InTAR combines the high computational efficiency of sequential execution with the reduced off-chip memory overhead of dataflow execution. It switches execution patterns automatically with a static schedule determined before circuit design based on resource constraints and problem sizes. Unlike previous reconfigurable accelerators, InTAR encodes reconfiguration schedules during circuit design, allowing model-specific optimizations that allocate only the necessary logic and interconnects. Thus, InTAR achieves a high clock frequency with fewer resources and low reconfiguration time. Furthermore, InTAR supports high-level tools such as HLS for fast design generation. We implement a set of multi-task HDV DNN kernels using InTAR. Compared with dataflow and sequential accelerators, InTAR exhibits $\mathbf{1.8\times}$ and $\mathbf{7.1 \times}$ speedups correspondingly. Moreover, we extend InTAR to GPT-2 medium as a more complex example, which is $\mathbf{3.65 \sim 39.14\times}$ faster and a $\mathbf{1.72 \sim 10.44\times}$ more DSP efficient than SoTA accelerators (Allo and DFX) on FPGAs. Additionally, this design demonstrates $\mathbf{1.66 \sim 7.17\times}$ better power efficiency than GPUs. Code: https://github.com/OswaldHe/InTAR
中文:InTAR加速器通过在执行过程中动态切换顺序与数据流模式,有效应对深度神经网络中高数据量变化带来的挑战,相比现有FPGA和GPU方案实现了显著的加速效果与能效提升。
English: The InTAR accelerator addresses the challenges of high data volume variation in deep neural networks by dynamically switching between sequential and dataflow execution patterns, achieving significant speedups and improved efficiency over existing FPGA and GPU solutions.

Authors:Christopher Tosh, Boyuan Zhang, Wesley Tansey
Title: Treatment response as a latent variable
Abstract:
Scientists often need to analyze the samples in a study that responded to treatment in order to refine their hypotheses and find potential causal drivers of response. Natural variation in outcomes makes teasing apart responders from non-responders a statistical inference problem. To handle latent responses, we introduce the causal two-groups (C2G) model, a causal extension of the classical two-groups model. The C2G model posits that treated samples may or may not experience an effect, according to some prior probability. We propose two empirical Bayes procedures for the causal two-groups model, one under semi-parametric conditions and another under fully nonparametric conditions. The semi-parametric model assumes additive treatment effects and is identifiable from observed data. The nonparametric model is unidentifiable, but we show it can still be used to test for response in each treated sample. We show empirically and theoretically that both methods for selecting responders control the false discovery rate at the target level with near-optimal power. We also propose two novel estimands of interest and provide a strategy for deriving estimand intervals in the unidentifiable nonparametric model. On a cancer immunotherapy dataset, the nonparametric C2G model recovers clinically-validated predictive biomarkers of both positive and negative outcomes. Code is available at https://github.com/tansey-lab/causal2groups.
中文: C2G模型通过提出经验贝叶斯方法处理潜在治疗反应,在控制误发现率的同时识别应答者,其有效性经理论分析和癌症免疫治疗数据验证。
English: The C2G model addresses latent treatment responses by proposing empirical Bayes methods that control false discovery rates while identifying responders, validated through theoretical analysis and cancer immunotherapy data.

Authors:Joshua Omolegan, Pak Hei Yeung, Madeleine K. Wyburd, Linde Hesse, Monique Haak, Intergrowth-21st Consortium, Ana I. L. Namburete, Nicola K. Dinsdale
Title: Exploring Test Time Adaptation for Subcortical Segmentation of the Fetal Brain in 3D Ultrasound
Abstract:
Monitoring the growth of subcortical regions of the fetal brain in ultrasound (US) images can help identify the presence of abnormal development. Manually segmenting these regions is a challenging task, but recent work has shown that it can be automated using deep learning. However, applying pretrained models to unseen freehand US volumes often leads to a degradation of performance due to the vast differences in acquisition and alignment. In this work, we first demonstrate that test time adaptation (TTA) can be used to improve model performance in the presence of both real and simulated domain shifts. We further propose a novel TTA method by incorporating a normative atlas as a prior for anatomy. In the presence of various types of domain shifts, we benchmark the performance of different TTA methods and demonstrate the improvements brought by our proposed approach, which may further facilitate automated monitoring of fetal brain development. Our code is available at https://github.com/joshuaomolegan/TTA-for-3D-Fetal-Subcortical-Segmentation.
中文摘要:通过引入规范图谱作为解剖先验的测试时自适应方法,有效提升了深度学习模型在不同域偏移下对胎儿脑部皮层下区域超声图像分割的性能,有助于实现胎儿大脑发育的自动化监测。
English Summary: Test time adaptation, enhanced with a normative atlas prior, improves deep learning model performance for segmenting fetal brain subcortical regions in ultrasound images under domain shifts, facilitating automated monitoring of brain development.

Authors:Zhining Liu, Rana Ali Amjad, Ravinarayana Adkathimar, Tianxin Wei, Hanghang Tong
Title: SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence
Abstract:
Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide better-grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information, an issue common in real-world scenarios. To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting. By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and grounded responses without additional training or iterative prompting. We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency. Our code and documentation are available at https://github.com/ZhiningLiu1998/SelfElicit.
Chinese: SelfElicit是一种推理时方法,通过利用深层注意力分数自动识别并突出上下文中的关键证据,帮助语言模型更好地利用上下文信息,从而在基于证据的任务上实现显著提升且无需额外训练。
English: SelfElicit is an inference-time method that enhances Language Models' ability to utilize key contextual evidence by automatically highlighting crucial information using deeper layer attention scores, leading to improved performance on evidence-based tasks without requiring additional training.

Authors:Raihan Seraj, Lili Meng, Tristan Sylvain
Title: Contextual bandits with entropy-based human feedback
Abstract:
In recent years, preference-based human feedback mechanisms have become essential for enhancing model performance across diverse applications, including conversational AI systems such as ChatGPT. However, existing approaches often neglect critical aspects, such as model uncertainty and the variability in feedback quality. To address these challenges, we introduce an entropy-based human feedback framework for contextual bandits, which dynamically balances exploration and exploitation by soliciting expert feedback only when model entropy exceeds a predefined threshold. Our method is model-agnostic and can be seamlessly integrated with any contextual bandit agent employing stochastic policies. Through comprehensive experiments, we show that our approach achieves significant performance improvements while requiring minimal human feedback, even under conditions of suboptimal feedback quality. This work not only presents a novel strategy for feedback solicitation but also highlights the robustness and efficacy of incorporating human guidance into machine learning systems. Our code is publicly available: https://github.com/BorealisAI/CBHF
中文摘要:本文提出了一种基于熵的上下文老虎机人工反馈框架,该框架仅在模型不确定性较高时动态请求专家输入,以最少的反馈量实现了显著性能提升,同时具备模型无关性并能适应不同的反馈质量。
English Summary: This paper introduces an entropy-based human feedback framework for contextual bandits that dynamically requests expert input only when model uncertainty is high, achieving significant performance gains with minimal feedback while being model-agnostic and robust to variable feedback quality.

Authors:Randolph W. Linderman, Yiran Chen, Scott W. Linderman
Title: A Bayesian Nonparametric Perspective on Mahalanobis Distance for Out of Distribution Detection
Abstract:
Bayesian nonparametric methods are naturally suited to the problem of out-of-distribution (OOD) detection. However, these techniques have largely been eschewed in favor of simpler methods based on distances between pre-trained or learned embeddings of data points. Here we show a formal relationship between Bayesian nonparametric models and the relative Mahalanobis distance score (RMDS), a commonly used method for OOD detection. Building on this connection, we propose Bayesian nonparametric mixture models with hierarchical priors that generalize the RMDS. We evaluate these models on the OpenOOD detection benchmark and show that Bayesian nonparametric methods can improve upon existing OOD methods, especially in regimes where training classes differ in their covariance structure and where there are relatively few data points per class.
中文: 贝叶斯非参数方法与相对马哈拉诺比斯距离评分在分布外检测中存在形式关联,通过引入分层先验的混合模型,在协方差结构差异大和每类数据点较少的情况下显著优于现有方法。
English: Bayesian nonparametric methods are formally linked to the relative Mahalanobis distance score for OOD detection and, when enhanced with hierarchical priors, outperform existing methods in scenarios with varying covariance structures and limited data per class.

Authors:Areeg Fahad Rasheed, M. Zarkoosh, Shimam Amer Chasib, Safa F. Abbas
Title: Data Augmentation to Improve Large Language Models in Food Hazard and Product Detection
Abstract:
The primary objective of this study is to demonstrate the impact of data augmentation using ChatGPT-4o-mini on food hazard and product analysis. The augmented data is generated using ChatGPT-4o-mini and subsequently used to train two large language models: RoBERTa-base and Flan-T5-base. The models are evaluated on test sets. The results indicate that using augmented data helped improve model performance across key metrics, including recall, F1 score, precision, and accuracy, compared to using only the provided dataset. The full code, including model training and the augmented dataset, can be found in this repository: https://github.com/AREEG94FAHAD/food-hazard-prdouct-cls
本研究显示,利用ChatGPT-4o-mini进行数据增强显著提升了RoBERTa-base和Flan-T5-base模型在食品危害与产品分析中的表现,有效改进了召回率、F1分数、精确度和准确率。
This study demonstrates that data augmentation with ChatGPT-4o-mini significantly enhances the performance of RoBERTa-base and Flan-T5-base models in food hazard and product analysis, improving recall, F1 score, precision, and accuracy.

Authors:Renqi Jia, Xiaokun Zhang, Bowei He, Qiannan Zhu, Weitao Xu, Jiehao Chen, Chen Ma
Title: Beyond Models! Explainable Data Valuation and Metric Adaption for Recommendation
Abstract:
User behavior records serve as the foundation for recommender systems. While the behavior data exhibits ease of acquisition, it often suffers from varying quality. Current methods employ data valuation to discern high-quality data from low-quality data. However, they tend to employ black-box design, lacking transparency and interpretability. Besides, they are typically tailored to specific evaluation metrics, leading to limited generality across various tasks. To overcome these issues, we propose an explainable and versatile framework DVR which can enhance the efficiency of data utilization tailored to any requirements of the model architectures and evaluation metrics. For explainable data valuation, a data valuator is presented to evaluate the data quality via calculating its Shapley value from the game-theoretic perspective, ensuring robust mathematical properties and reliability. In order to accommodate various evaluation metrics, including differentiable and non-differentiable ones, a metric adapter is devised based on reinforcement learning, where a metric is treated as the reinforcement reward that guides model optimization. Extensive experiments conducted on various benchmarks verify that our framework can improve the performance of current recommendation algorithms on various metrics including ranking accuracy, diversity, and fairness. Specifically, our framework achieves up to 34.7\% improvements over existing methods in terms of representative NDCG metric. The code is available at https://github.com/renqii/DVR.
中文:提出的DVR框架通过博弈论视角的Shapley值实现可解释数据评估,并结合基于强化学习的指标适配器兼容各类评估标准,在多个基准测试中显著提升了推荐系统的综合性能。
English: The proposed DVR framework introduces an explainable and versatile approach to data valuation in recommender systems, using Shapley values for transparent quality assessment and a reinforcement learning-based metric adapter to accommodate various evaluation metrics, achieving significant performance improvements across multiple benchmarks.

Authors:Miranda Muqing Miao, Michael Kearns
Title: Hallucination, Monofacts, and Miscalibration: An Empirical Investigation
Abstract:
Hallucinated facts in large language models (LLMs) have recently been shown to obey a statistical lower bound determined by the monofact rate (related to the classical Good-Turing missing mass estimator) minus model miscalibration (Kalai & Vempala, 2024). We present the first empirical investigation of this three-way relationship in classical n-gram models and fine-tuned encoder-decoder Transformers. By generating training data from Pareto distributions with varying shape parameters, we systematically control the monofact rates and establish its positive relationship with hallucination. To bridge theory and practice, we derive an empirical analog of the hallucination bound by replacing the population miscalibration term (Section 2.1) with an empirical bin-wise KL divergence and confirm its practical viability. We then introduce selective upweighting -- a simple yet effective technique that strategically repeats as little as 5% of training examples -- to deliberately inject miscalibration into the model. This intervention reduces hallucination by up to 40%, challenging universal deduplication policies. Our experiments reveal a critical trade-off: selective upweighting maintains pre-injection levels of accuracy while substantially reducing hallucination, whereas standard training gradually improves accuracy but fails to address persistently high hallucination, indicating an inherent tension in optimization objectives.
中文: 研究表明,语言模型中的幻觉与单事实率呈正相关,通过选择性加权引入可控的校准偏差,可在保持准确性的同时将幻觉降低达40%,揭示了与标准训练方法之间的优化目标冲突。
English: The study demonstrates that hallucination in language models is positively related to monofact rates and can be reduced by up to 40% through selective upweighting, which introduces controlled miscalibration while maintaining accuracy, revealing a trade-off with standard training methods.

Authors:Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang
Title: RoToR: Towards More Reliable Responses for Order-Invariant Inputs
Abstract:
Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practical: (1) training and inference distribution mismatch arising from modifying positional ID assignments to enforce invariance, and (2) failure to adapt to mixture of order-invariant and sensitive inputs in practical listwise problems. Then, to overcome these issues we propose (1) RoToR, a zero-shot invariant LM for genuinely order-invariant inputs with minimal modifications of positional IDs, and (2) Selective Routing, an adaptive framework that handles both order-invariant and order-sensitive inputs in listwise tasks. On the Lost in the middle (LitM), Knowledge Graph QA (KGQA), and MMLU benchmarks, we show that RoToR with Selective Routing can effectively handle practical listwise input tasks in a zero-shot manner (https://github.com/soyoung97/RoToR)
中文摘要:本研究通过提出最小化位置ID修改的RoToR模型和能同时处理顺序不变与顺序敏感输入的Selective Routing框架,解决了零样本顺序不变语言模型的局限性,并在多个基准测试中验证了其有效性。
English Summary: This study addresses limitations in zero-shot order-invariant language models by proposing RoToR with minimal positional ID modifications and Selective Routing to handle both order-invariant and order-sensitive inputs, demonstrating effectiveness on multiple benchmarks.

Authors:Huiyao Chen, Meishan Zhang, Jing Li, Min Zhang, Lilja Øvrelid, Jan Hajič, Hao Fei
Title: Semantic Role Labeling: A Systematical Survey
Abstract:
Semantic role labeling (SRL) is a central natural language processing (NLP) task aiming to understand the semantic roles within texts, facilitating a wide range of downstream applications. While SRL has garnered extensive and enduring research, there is currently a lack of a comprehensive survey that thoroughly organizes and synthesizes the field. This paper aims to review the entire research trajectory of the SRL community over the past two decades. We begin by providing a complete definition of SRL. To offer a comprehensive taxonomy, we categorize SRL methodologies into four key perspectives: model architectures, syntax feature modeling, application scenarios, and multi-modal extensions. Further, we discuss SRL benchmarks, evaluation metrics, and paradigm modeling approaches, while also exploring practical applications across various domains. Finally, we analyze future research directions in SRL, addressing the evolving role of SRL in the age of large language models (LLMs) and its potential impact on the broader NLP landscape. We maintain a public repository and consistently update related resources at: https://github.com/DreamH1gh/Awesome-SRL
中文摘要:本文系统梳理了语义角色标注领域二十年的研究进展,涵盖方法分类、评估体系、实际应用,并探讨了该技术在大语言模型时代的发展方向。
English Summary: This paper provides a comprehensive survey of semantic role labeling (SRL) research over the past two decades, covering methodology taxonomy, benchmarks, applications, and future directions in the context of large language models.

Authors:Shaina Raza, Rizwan Qureshi, Anam Zahid, Safiullah Kamawal, Ferhat Sadak, Joseph Fioresi, Muhammaed Saeed, Ranjan Sapkota, Aditya Jain, Anas Zafar, Muneeb Ul Hassan, Aizan Zafar, Hasan Maqbool, Ashmal Vayani, Jia Wu, Maged Shoman
Title: Who is Responsible? The Data, Models, Users or Regulations? A Comprehensive Survey on Responsible Generative AI for a Sustainable Future
Abstract:
Generative AI is moving rapidly from research into real world deployment across sectors, which elevates the need for responsible development, deployment, evaluation, and governance. To address this pressing challenge, in this study, we synthesize the landscape of responsible generative AI across methods, benchmarks, and policies, and connects governance expectations to concrete engineering practice. We follow a prespecified search and screening protocol focused on post-ChatGPT era with selective inclusion of foundational work for definitions, and we conduct a narrative and thematic synthesis. Three findings emerge; First, benchmark and practice coverage is dense for bias and toxicity but relatively sparse for privacy and provenance, deepfake and media integrity risk, and system level failure in tool using and agentic settings. Second, many evaluations remain static and task local, which limits evidence portability for audit and lifecycle assurance. Third, documentation and metric validity are inconsistent, which complicates comparison across releases and domains. We outline a research and practice agenda that prioritizes adaptive and multimodal evaluation, privacy and provenance testing, deepfake risk assessment, calibration and uncertainty reporting, versioned and documented artifacts, and continuous monitoring. Limitations include reliance on public artifacts and the focus period, which may under represent capabilities reported later. The survey offers a path to align development and evaluation with governance needs and to support safe, transparent, and accountable deployment across domains. Project page: https://anas-zafar.github.io/responsible-ai.github.io , GitHub: https://github.com/anas-zafar/Responsible-AI
中文摘要:本研究综合分析了负责任生成式AI的发展现状,发现隐私保护、深度伪造检测和系统评估方面存在不足,并提出适应性测试与持续监控的研究框架,以推动符合治理需求的安全部署。
English Summary: This study synthesizes the landscape of responsible generative AI, revealing gaps in privacy, deepfake detection, and system evaluation while proposing an agenda for adaptive testing and continuous monitoring to align development with governance needs.

Authors:Andrianos Michail, Simon Clematide, Rico Sennrich
Title: Examining Multilingual Embedding Models Cross-Lingually Through LLM-Generated Adversarial Examples
Abstract:
The evaluation of cross-lingual semantic search models is often limited to existing datasets from tasks such as information retrieval and semantic textual similarity. We introduce Cross-Lingual Semantic Discrimination (CLSD), a lightweight evaluation task that requires only parallel sentences and a Large Language Model (LLM) to generate adversarial distractors. CLSD measures an embedding model's ability to rank the true parallel sentence above semantically misleading but lexically similar alternatives. As a case study, we construct CLSD datasets for German--French in the news domain. Our experiments show that models fine-tuned for retrieval tasks benefit from pivoting through English, whereas bitext mining models perform best in direct cross-lingual settings. A fine-grained similarity analysis further reveals that embedding models differ in their sensitivity to linguistic perturbations. We release our code and datasets under AGPL-3.0: https://github.com/impresso/cross_lingual_semantic_discrimination
Chinese: 本文提出跨语言语义辨别(CLSD)评估方法,通过平行句和大型语言模型生成干扰项来测试嵌入模型识别真实翻译的能力,实验表明不同模型在跨语言场景下表现各异,检索优化模型需经英语中转,而双语挖掘模型直接跨语言效果最佳。
English: This paper introduces Cross-Lingual Semantic Discrimination (CLSD), a lightweight evaluation method using parallel sentences and LLM-generated distractors to assess embedding models' ability to identify true translations over misleading alternatives, with experiments revealing different model behaviors in cross-lingual settings.

Authors:Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, Alan Yuille
Title: Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
Abstract:
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings. The code and data are released in https://github.com/XingruiWang/Spatial457.
中文: 该摘要介绍了Spatial457合成数据集,旨在通过多难度评估框架测试大型多模态模型的6D空间推理能力,发现模型在3D任务中表现下降并识别出预测偏差。
English: This abstract introduces Spatial457, a synthetic dataset designed to evaluate large multimodal models' 6D spatial reasoning capabilities across varying complexities, revealing performance declines in 3D tasks and identifying prediction biases through a novel evaluation framework.

Authors:Karish Grover, Geoffrey J. Gordon, Christos Faloutsos
Title: CurvGAD: Leveraging Curvature for Enhanced Graph Anomaly Detection
Abstract:
Does the intrinsic curvature of complex networks hold the key to unveiling graph anomalies that conventional approaches overlook? Reconstruction-based graph anomaly detection (GAD) methods overlook such geometric outliers, focusing only on structural and attribute-level anomalies. To this end, we propose CurvGAD - a mixed-curvature graph autoencoder that introduces the notion of curvature-based geometric anomalies. CurvGAD introduces two parallel pipelines for enhanced anomaly interpretability: (1) Curvature-equivariant geometry reconstruction, which focuses exclusively on reconstructing the edge curvatures using a mixed-curvature, Riemannian encoder and Gaussian kernel-based decoder; and (2) Curvature-invariant structure and attribute reconstruction, which decouples structural and attribute anomalies from geometric irregularities by regularizing graph curvature under discrete Ollivier-Ricci flow, thereby isolating the non-geometric anomalies. By leveraging curvature, CurvGAD refines the existing anomaly classifications and identifies new curvature-driven anomalies. Extensive experimentation over 10 real-world datasets (both homophilic and heterophilic) demonstrates an improvement of up to 6.5% over state-of-the-art GAD methods. The code is available at: https://github.com/karish-grover/curvgad.
Chinese: CurvGAD提出了一种新的图异常检测方法,通过混合曲率重构识别几何异常,相比现有方法将检测准确率最高提升了6.5%。
English: CurvGAD introduces a novel graph anomaly detection method that identifies geometric outliers through mixed-curvature reconstruction, improving detection accuracy by up to 6.5% over existing approaches.

Authors:Peiyao Xiao, Chaosheng Dong, Shaofeng Zou, Kaiyi Ji
Title: LDC-MTL: Balancing Multi-Task Learning through Scalable Loss Discrepancy Control
Abstract:
Multi-task learning (MTL) has been widely adopted for its ability to simultaneously learn multiple tasks. While existing gradient manipulation methods often yield more balanced solutions than simple scalarization-based approaches, they typically incur a significant computational overhead of $\mathcal{O}(K)$ in both time and memory, where $K$ is the number of tasks. In this paper, we propose LDC-MTL, a simple and scalable loss discrepancy control approach for MTL, formulated from a bilevel optimization perspective. Our method incorporates three key components: (i) a coarse loss pre-normalization, (ii) a bilevel formulation for fine-grained loss discrepancy control, and (iii) a scalable first-order bilevel algorithm that requires only $\mathcal{O}(1)$ time and memory. Theoretically, we prove that LDC-MTL guarantees convergence not only to a stationary point of the bilevel problem with loss discrepancy control but also to an $ε$-accurate Pareto stationary point for all $K$ loss functions under mild conditions. Extensive experiments on diverse multi-task datasets demonstrate the superior performance of LDC-MTL in both accuracy and efficiency. Code is available at https://github.com/OptMN-Lab/LDC-MTL.
中文: LDC-MTL是一种基于双层优化的可扩展多任务学习方法,能以O(1)计算成本控制损失差异,在实验中展现出优越性能并具有理论收敛保证。
English: LDC-MTL is a scalable multi-task learning method that uses bilevel optimization to control loss discrepancy with only O(1) computational cost, achieving both theoretical convergence guarantees and superior performance in experiments.

Authors:Zhikai Wu, Sifan Wang, Shiyang Zhang, Sizhuang He, Min Zhu, Anran Jiao, Lu Lu, David van Dijk
Title: TANTE: Time-Adaptive Operator Learning via Neural Taylor Expansion
Abstract:
Operator learning for time-dependent partial differential equations (PDEs) has seen rapid progress in recent years, enabling efficient approximation of complex spatiotemporal dynamics. However, most existing methods rely on fixed time step sizes during rollout, which limits their ability to adapt to varying temporal complexity and often leads to error accumulation. Here, we propose the Time-Adaptive Transformer with Neural Taylor Expansion (TANTE), a novel operator-learning framework that produces continuous-time predictions with adaptive step sizes. TANTE predicts future states by performing a Taylor expansion at the current state, where neural networks learn both the higher-order temporal derivatives and the local radius of convergence. This allows the model to dynamically adjust its rollout based on the local behavior of the solution, thereby reducing cumulative error and improving computational efficiency. We demonstrate the effectiveness of TANTE across a wide range of PDE benchmarks, achieving superior accuracy and adaptability compared to fixed-step baselines, delivering accuracy gains of 60-80 % and speed-ups of 30-40 % at inference time. The code is publicly available at https://github.com/zwu88/TANTE for transparency and reproducibility.
Chinese: 时间自适应变换器与神经泰勒展开(TANTE)是一种新型算子学习框架,通过动态调整时间步长求解偏微分方程,相比固定步长方法实现了60-80%的精度提升和30-40%的计算效率提升。
English: The Time-Adaptive Transformer with Neural Taylor Expansion (TANTE) is a novel operator-learning framework that dynamically adjusts time steps during PDE solution rollout, achieving significant improvements in accuracy (60-80%) and computational efficiency (30-40%) compared to fixed-step methods.

Authors:Lemuel Puglisi, Daniel C. Alexander, Daniele Ravì
Title: Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion
Abstract:
The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction at the global and voxel level. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: https://github.com/LemuelPuglisi/BrLP.
中文: 提出的Brain Latent Progression(BrLP)模型通过在紧凑潜在空间中运行、整合个体元数据和疾病动态知识,并利用新型稳定化算法保证时空一致性,有效解决了AI驱动疾病进展预测中的关键难题。
English: The proposed Brain Latent Progression (BrLP) model addresses key challenges in AI-driven disease progression prediction by operating in a compact latent space, integrating subject metadata and disease dynamics, and ensuring spatiotemporal consistency through its novel stabilization algorithm.

Authors:Kevin Flanagan, Dima Damen, Michael Wray
Title: Moment of Untruth: Dealing with Negative Queries in Video Moment Retrieval
Abstract:
Video Moment Retrieval is a common task to evaluate the performance of visual-language models - it involves localising start and end times of moments in videos from query sentences. The current task formulation assumes that the queried moment is present in the video, resulting in false positive moment predictions when irrelevant query sentences are provided. In this paper we propose the task of Negative-Aware Video Moment Retrieval (NA-VMR), which considers both moment retrieval accuracy and negative query rejection accuracy. We make the distinction between In-Domain and Out-of-Domain negative queries and provide new evaluation benchmarks for two popular video moment retrieval datasets: QVHighlights and Charades-STA. We analyse the ability of current SOTA video moment retrieval approaches to adapt to Negative-Aware Video Moment Retrieval and propose UniVTG-NA, an adaptation of UniVTG designed to tackle NA-VMR. UniVTG-NA achieves high negative rejection accuracy (avg. $98.4\%$) scores while retaining moment retrieval scores to within $3.87\%$ Recall@1. Dataset splits and code are available at https://github.com/keflanagan/MomentofUntruth
中文摘要:本文提出负感知视频片段检索新任务,通过区分域内与域外负查询构建评估基准,所提UniVTG-NA模型在保持检索性能的同时实现98.4%的平均负样本拒绝准确率。
English Summary: This paper introduces Negative-Aware Video Moment Retrieval (NA-VMR), a new task that evaluates both moment localization accuracy and negative query rejection, proposing UniVTG-NA model which maintains high retrieval performance while achieving 98.4% average negative rejection accuracy.

Authors:Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian
Title: Measuring Diversity in Synthetic Datasets
Abstract:
Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at: https://github.com/bluewhalelab/dcscore.
Chinese: 本文提出DCScore这一新方法,通过将多样性评估构建为样本分类任务来衡量大语言模型生成的合成数据集的多样性,实验证明该方法与多样性基准相关性更强,且相比现有方法显著降低了计算成本。
English: This paper introduces DCScore, a novel method for evaluating the diversity of synthetic datasets generated by large language models by framing it as a classification task, which demonstrates stronger correlation with diversity benchmarks and reduces computational costs compared to existing approaches.

Authors:Jiahe Jin, Yanheng He, Mingyan Yang
Title: Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?
Abstract:
In this work, we identify the "2D-Cheating" problem in 3D LLM evaluation, where these tasks might be easily solved by VLMs with rendered images of point clouds, exposing ineffective evaluation of 3D LLMs' unique 3D capabilities. We test VLM performance across multiple 3D LLM benchmarks and, using this as a reference, propose principles for better assessing genuine 3D understanding. We also advocate explicitly separating 3D abilities from 1D or 2D aspects when evaluating 3D LLMs. Code and data are available at https://github.com/LLM-class-group/Revisiting-3D-LLM-Benchmarks
中文摘要:本研究揭示了3D大语言模型评估中的"二维作弊"问题,即视觉语言模型可通过处理点云渲染图像规避真正的3D理解,并提出了区分三维能力与低维特征的评估新标准。
English Summary: This study reveals the "2D-Cheating" issue in 3D LLM evaluation, demonstrating how visual language models can bypass true 3D understanding by processing rendered images, and proposes new assessment principles to better measure genuine 3D capabilities.

Authors:Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He
Title: Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning
Abstract:
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at https://github.com/qifanyu/RELAY.
Chinese: RELAY通过将思维链推理步骤与循环Transformer的迭代对齐,使其能够为超出训练长度的复杂问题生成准确推理链,进而用于微调自回归模型以提升性能。
English: RELAY aligns CoT reasoning steps with loop iterations in Looped Transformers, enabling them to generate accurate reasoning chains for complex problems beyond training length, which are then used to fine-tune auto-regressive models for improved performance.

Authors:Thomas Cass, Francesco Piatti, Jeffrey Pei
Title: Numerical Schemes for Signature Kernels
Abstract:
Signature kernels have emerged as a powerful tool within kernel methods for sequential data. In the paper "The Signature Kernel is the solution of a Goursat PDE", the authors identify a kernel trick that demonstrates that, for continuously differentiable paths, the signature kernel satisfies a Goursat problem for a hyperbolic partial differential equation (PDE) in two independent time variables. While finite difference methods have been explored for this PDE, they face limitations in accuracy and stability when handling highly oscillatory inputs. In this work, we introduce two advanced numerical schemes that leverage polynomial representations of boundary conditions through either approximation or interpolation techniques, and rigorously establish the theoretical convergence of the polynomial approximation scheme. Experimental evaluations reveal that our approaches yield improvements of several orders of magnitude in mean absolute percentage error (MAPE) compared to traditional finite difference schemes, without increasing computational complexity. Furthermore, like finite difference methods, our algorithms can be GPU-parallelized to reduce computational complexity from quadratic to linear in the length of the input sequences, thereby improving scalability for high-frequency data. We have implemented these algorithms in a dedicated Python library, which is publicly available at: https://github.com/FrancescoPiatti/polysigkernel.
中文: 本文提出了两种利用多项式表示的先进数值方案来解决签名核的Goursat偏微分方程,通过GPU并行化实现了精度数量级的提升和线性计算复杂度。
English: This paper introduces two advanced numerical schemes using polynomial representations to solve the signature kernel's Goursat PDE, achieving orders of magnitude improvement in accuracy and linear computational complexity via GPU parallelization.

Authors:Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou
Title: mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
Abstract:
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.
Chinese: 本研究提出了高质量合成多模态数据的三个标准——广泛覆盖、强跨模态对齐和高保真度,并据此生成数据集训练mmE5模型,在多个基准测试中取得了最优性能。
English: This study establishes three criteria for high-quality synthetic multimodal data—broad scope, robust cross-modal alignment, and high fidelity—and uses them to create datasets that train the mmE5 model, achieving state-of-the-art performance on benchmarks.

Authors:Daeyoung Roh, Donghee Han, Daehee Kim, Keejun Han, Mun Yi
Title: Closer through commonality: Enhancing hypergraph contrastive learning with shared groups
Abstract:
Hypergraphs provide a superior modeling framework for representing complex multidimensional relationships in the context of real-world interactions that often occur in groups, overcoming the limitations of traditional homogeneous graphs. However, there have been few studies on hypergraphbased contrastive learning, and existing graph-based contrastive learning methods have not been able to fully exploit the highorder correlation information in hypergraphs. Here, we propose a Hypergraph Fine-grained contrastive learning (HyFi) method designed to exploit the complex high-dimensional information inherent in hypergraphs. While avoiding traditional graph augmentation methods that corrupt the hypergraph topology, the proposed method provides a simple and efficient learning augmentation function by adding noise to node features. Furthermore, we expands beyond the traditional dichotomous relationship between positive and negative samples in contrastive learning by introducing a new relationship of weak positives. It demonstrates the importance of fine-graining positive samples in contrastive learning. Therefore, HyFi is able to produce highquality embeddings, and outperforms both supervised and unsupervised baselines in average rank on node classification across 10 datasets. Our approach effectively exploits high-dimensional hypergraph information, shows significant improvement over existing graph-based contrastive learning methods, and is efficient in terms of training speed and GPU memory cost. The source code is available at https://github.com/Noverse0/HyFi.git.
中文:HyFi方法通过引入特征噪声和弱正样本关系,有效利用超图的高维信息,在节点分类任务中展现出优于现有方法的性能。
English: The HyFi method introduces a hypergraph contrastive learning approach that enhances high-dimensional information utilization through feature noise injection and weak positive sample relationships, achieving superior performance in node classification tasks.

Authors:Ziyue Yang, Kehan Wang, Yuhang Ming, Yong Peng, Han Yang, Qiong Chen, Wanzeng Kong
Title: Uncertainty Aware Human-machine Collaboration in Camouflaged Object Detection
Abstract:
Camouflaged Object Detection (COD), the task of identifying objects concealed within their environments, has seen rapid growth due to its wide range of practical applications. A key step toward developing trustworthy COD systems is the estimation and effective utilization of uncertainty. In this work, we propose a human-machine collaboration framework for classifying the presence of camouflaged objects, leveraging the complementary strengths of computer vision (CV) models and noninvasive brain-computer interfaces (BCIs). Our approach introduces a multiview backbone to estimate uncertainty in CV model predictions, utilizes this uncertainty during training to improve efficiency, and defers low-confidence cases to human evaluation via RSVP-based BCIs during testing for more reliable decision-making. We evaluated the framework in the CAMO dataset, achieving state-of-the-art results with an average improvement of 4.56\% in balanced accuracy (BA) and 3.66\% in the F1 score compared to existing methods. For the best-performing participants, the improvements reached 7.6\% in BA and 6.66\% in the F1 score. Analysis of the training process revealed a strong correlation between our confidence measures and precision, while an ablation study confirmed the effectiveness of the proposed training policy and the human-machine collaboration strategy. In general, this work reduces human cognitive load, improves system reliability, and provides a strong foundation for advancements in real-world COD applications and human-computer interaction. Our code and data are available at: https://github.com/ziyuey/Uncertainty-aware-human-machine-collaboration-in-camouflaged-object-identification.
中文: 本研究提出了一种人机协作框架,通过结合计算机视觉不确定性估计与脑机接口技术,提升了伪装物体检测的性能与可靠性,并取得了领先的实验结果。
English: This study introduces a human-machine collaboration framework that enhances camouflaged object detection by integrating computer vision uncertainty estimation with brain-computer interfaces, achieving state-of-the-art performance and improved reliability.

Authors:Tianle Liu, Shuangming Zhao, Wanshou Jiang, Bingxuan Guo
Title: Sat-DN: Implicit Surface Reconstruction from Multi-View Satellite Images with Depth and Normal Supervision
Abstract:
With advancements in satellite imaging technology, acquiring high-resolution multi-view satellite imagery has become increasingly accessible, enabling rapid and location-independent ground model reconstruction. However, traditional stereo matching methods struggle to capture fine details, and while neural radiance fields (NeRFs) achieve high-quality reconstructions, their training time is prohibitively long. Moreover, challenges such as low visibility of building facades, illumination and style differences between pixels, and weakly textured regions in satellite imagery further make it hard to reconstruct reasonable terrain geometry and detailed building facades. To address these issues, we propose Sat-DN, a novel framework leveraging a progressively trained multi-resolution hash grid reconstruction architecture with explicit depth guidance and surface normal consistency constraints to enhance reconstruction quality. The multi-resolution hash grid accelerates training, while the progressive strategy incrementally increases the learning frequency, using coarse low-frequency geometry to guide the reconstruction of fine high-frequency details. The depth and normal constraints ensure a clear building outline and correct planar distribution. Extensive experiments on the DFC2019 dataset demonstrate that Sat-DN outperforms existing methods, achieving state-of-the-art results in both qualitative and quantitative evaluations. The code is available at https://github.com/costune/SatDN.
中文摘要:Sat-DN框架通过多分辨率哈希网格结合深度引导和表面法向约束,解决了卫星图像重建中的细节缺失和训练效率问题,在DFC2019数据集上实现了最优性能。
English Summary: The proposed Sat-DN framework overcomes limitations in satellite imagery reconstruction by combining multi-resolution hash grids with depth guidance and surface normal constraints, achieving state-of-the-art results through accelerated training and enhanced detail recovery.

Authors:Fenghe Tang, Qingsong Yao, Wenxin Ma, Chenxu Wu, Zihang Jiang, S. Kevin Zhou
Title: Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation
Abstract:
Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE
中文:Hi-End-MAE框架通过利用分层ViT表征和编码器驱动的重建,提升了医学图像分割的性能,在多个基准测试中表现卓越。
English: The Hi-End-MAE framework enhances medical image segmentation by leveraging hierarchical ViT representations and encoder-driven reconstruction, achieving superior performance across multiple benchmarks.

Authors:Keqi Chen, Lilien Schewski, Vinkle Srivastav, Joël Lavanchy, Didier Mutter, Guido Beldi, Sandra Keller, Nicolas Padoy
Title: When do they StOP?: A First Step Towards Automatically Identifying Team Communication in the Operating Room
Abstract:
Purpose: Surgical performance depends not only on surgeons' technical skills but also on team communication within and across the different professional groups present during the operation. Therefore, automatically identifying team communication in the OR is crucial for patient safety and advances in the development of computer-assisted surgical workflow analysis and intra-operative support systems. To take the first step, we propose a new task of detecting communication briefings involving all OR team members, i.e. the team Time-out and the StOP?-protocol, by localizing their start and end times in video recordings of surgical operations. Methods: We generate an OR dataset of real surgeries, called Team-OR, with more than one hundred hours of surgical videos captured by the multi-view camera system in the OR. The dataset contains temporal annotations of 33 Time-out and 22 StOP?-protocol activities in total. We then propose a novel group activity detection approach, where we encode both scene context and action features, and use an efficient neural network model to output the results. Results: The experimental results on the Team-OR dataset show that our approach outperforms existing state-of-the-art temporal action detection approaches. It also demonstrates the lack of research on group activities in the OR, proving the significance of our dataset. Conclusion: We investigate the Team Time-Out and the StOP?-protocol in the OR, by presenting the first OR dataset with temporal annotations of group activities protocols, and introducing a novel group activity detection approach that outperforms existing approaches. Code is available at https://github.com/CAMMA-public/Team-OR.
中文: 本研究提出了一种新的群体活动检测方法和Team-OR数据集,用于自动识别手术中的团队沟通简报,其性能优于现有方法。
English: This study introduces a novel group activity detection method and the Team-OR dataset to automatically identify team communication briefings in surgical operations, demonstrating superior performance over existing approaches.

Authors:Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
Title: FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
Abstract:
We present FloVD, a novel video diffusion model for camera-controllable video generation. FloVD leverages optical flow to represent the motions of the camera and moving objects. This approach offers two key benefits. Since optical flow can be directly estimated from videos, our approach allows for the use of arbitrary training videos without ground-truth camera parameters. Moreover, as background optical flow encodes 3D correlation across different viewpoints, our method enables detailed camera control by leveraging the background motion. To synthesize natural object motion while supporting detailed camera control, our framework adopts a two-stage video synthesis pipeline consisting of optical flow generation and flow-conditioned video synthesis. Extensive experiments demonstrate the superiority of our method over previous approaches in terms of accurate camera control and natural object motion synthesis.
Chinese: FloVD是一种新型视频扩散模型,利用光流实现相机可控的视频生成,无需真实相机参数即可使用任意视频进行训练,并在相机控制和自然物体运动方面表现出优越性能。
English: FloVD is a novel video diffusion model that uses optical flow to enable camera-controllable video generation, allowing training with arbitrary videos without ground-truth camera parameters and achieving superior camera control and natural object motion.

Authors:Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, Joseph E. Gonzalez
Title: The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks
Abstract:
Large Reasoning Models (LRMs) represent a breakthrough in AI problem-solving capabilities, but their effectiveness in interactive environments can be limited. This paper introduces and analyzes overthinking in LRMs. A phenomenon where models favor extended internal reasoning chains over environmental interaction. Through experiments on software engineering tasks using SWE Bench Verified, we observe three recurring patterns: Analysis Paralysis, Rogue Actions, and Premature Disengagement. We propose a framework to study these behaviors, which correlates with human expert assessments, and analyze 4018 trajectories. We observe that higher overthinking scores correlate with decreased performance, with reasoning models exhibiting stronger tendencies toward overthinking compared to non-reasoning models. Our analysis reveals that simple efforts to mitigate overthinking in agentic environments, such as selecting the solution with the lower overthinking score, can improve model performance by almost 30% while reducing computational costs by 43%. These results suggest that mitigating overthinking has strong practical implications. We suggest that by leveraging native function-calling capabilities and selective reinforcement learning overthinking tendencies could be mitigated. We also open-source our evaluation framework and dataset to facilitate research in this direction at https://github.com/AlexCuadron/Overthinking.
中文摘要:本文发现大型推理模型存在"过度思考"现象,即过度依赖内部推理而忽视环境交互会降低任务表现,研究表明简单缓解策略可使模型性能提升30%同时减少43%计算成本。
English Summary: This paper identifies "overthinking" in Large Reasoning Models as a counterproductive tendency toward excessive internal reasoning that reduces task performance, and demonstrates that simple mitigation strategies can boost performance by 30% while cutting computational costs by 43%.

Authors:Yilu Wu, Chenhui Zhu, Shuai Wang, Hanlin Wang, Jing Wang, Zhaoxiang Zhang, Limin Wang
Title: Learning Human Skill Generators at Key-Step Levels
Abstract:
We are committed to learning human skill generators at key-step levels. The generation of skills is a challenging endeavor, but its successful implementation could greatly facilitate human skill learning and provide more experience for embodied intelligence. Although current video generation models can synthesis simple and atomic human operations, they struggle with human skills due to their complex procedure process. Human skills involve multi-step, long-duration actions and complex scene transitions, so the existing naive auto-regressive methods for synthesizing long videos cannot generate human skills. To address this, we propose a novel task, the Key-step Skill Generation (KS-Gen), aimed at reducing the complexity of generating human skill videos. Given the initial state and a skill description, the task is to generate video clips of key steps to complete the skill, rather than a full-length video. To support this task, we introduce a carefully curated dataset and define multiple evaluation metrics to assess performance. Considering the complexity of KS-Gen, we propose a new framework for this task. First, a multimodal large language model (MLLM) generates descriptions for key steps using retrieval argument. Subsequently, we use a Key-step Image Generator (KIG) to address the discontinuity between key steps in skill videos. Finally, a video generation model uses these descriptions and key-step images to generate video clips of the key steps with high temporal consistency. We offer a detailed analysis of the results, hoping to provide more insights on human skill generation. All models and data are available at https://github.com/MCG-NJU/KS-Gen.
Chinese Summary: 我们提出了关键步骤技能生成(KS-Gen)新任务,通过生成关键步骤视频片段而非完整视频来简化人类技能视频制作,采用结合多模态大语言模型和图像生成的新框架以提升时间连贯性。
English Summary: We introduce the Key-step Skill Generation (KS-Gen) task to simplify human skill video creation by generating key-step clips instead of full videos, using a novel framework that combines multimodal language models and image generation for improved temporal consistency.

Authors:Junyi An, Chao Qu, Yun-Fei Shi, XinHao Liu, Qianwei Tang, Fenglei Cao, Yuan Qi
Title: Equivariant Masked Position Prediction for Efficient Molecular Representation
Abstract:
Graph neural networks (GNNs) have shown considerable promise in computational chemistry. However, the limited availability of molecular data raises concerns regarding GNNs' ability to effectively capture the fundamental principles of physics and chemistry, which constrains their generalization capabilities. To address this challenge, we introduce a novel self-supervised approach termed Equivariant Masked Position Prediction (EMPP), grounded in intramolecular potential and force theory. Unlike conventional attribute masking techniques, EMPP formulates a nuanced position prediction task that is more well-defined and enhances the learning of quantum mechanical features. EMPP also bypasses the approximation of the Gaussian mixture distribution commonly used in denoising methods, allowing for more accurate acquisition of physical properties. Experimental results indicate that EMPP significantly enhances performance of advanced molecular architectures, surpassing state-of-the-art self-supervised approaches. Our code is released in https://github.com/ajy112/EMPP
中文: 本研究提出的等变掩码位置预测方法通过精确的位置预测任务,有效增强了图神经网络对量子力学特征的学习能力,在分子建模中显著超越了现有自监督方法。
English: The proposed Equivariant Masked Position Prediction (EMPP) method enhances graph neural networks' ability to learn quantum mechanical features through precise position prediction, significantly outperforming existing self-supervised approaches in molecular modeling.

Authors:Xiaomeng Wang, Zhengyu Zhao, Martha Larson
Title: Typographic Attacks in a Multi-Image Setting
Abstract:
Large Vision-Language Models (LVLMs) are susceptible to typographic attacks, which are misclassifications caused by an attack text that is added to an image. In this paper, we introduce a multi-image setting for studying typographic attacks, broadening the current emphasis of the literature on attacking individual images. Specifically, our focus is on attacking image sets without repeating the attack query. Such non-repeating attacks are stealthier, as they are more likely to evade a gatekeeper than attacks that repeat the same attack text. We introduce two attack strategies for the multi-image setting, leveraging the difficulty of the target image, the strength of the attack text, and text-image similarity. Our text-image similarity approach improves attack success rates by 21% over random, non-specific methods on the CLIP model using ImageNet while maintaining stealth in a multi-image scenario. An additional experiment demonstrates transferability, i.e., text-image similarity calculated using CLIP transfers when attacking InstructBLIP.
Chinese: 大型视觉语言模型在多图像场景下易受隐蔽的排版攻击,采用非重复攻击文本提高了规避检测的能力,而基于文本-图像相似性的攻击策略将成功率较随机方法提升了21%,并展现出跨模型的迁移性。
English: Large Vision-Language Models are vulnerable to stealthy typographic attacks in multi-image scenarios, where non-repeating attack texts enhance evasion and a text-image similarity strategy boosts success rates by 21% over random methods while demonstrating transferability across models.

Authors:Yunjiang Xu, Lingzhi Li, Jin Wang, Benyuan Yang, Zhiwen Wu, Xinhong Chen, Jianping Wang
Title: CoDynTrust: Robust Asynchronous Collaborative Perception via Dynamic Feature Trust Modulus
Abstract:
Collaborative perception, fusing information from multiple agents, can extend perception range so as to improve perception performance. However, temporal asynchrony in real-world environments, caused by communication delays, clock misalignment, or sampling configuration differences, can lead to information mismatches. If this is not well handled, then the collaborative performance is patchy, and what's worse safety accidents may occur. To tackle this challenge, we propose CoDynTrust, an uncertainty-encoded asynchronous fusion perception framework that is robust to the information mismatches caused by temporal asynchrony. CoDynTrust generates dynamic feature trust modulus (DFTM) for each region of interest by modeling aleatoric and epistemic uncertainty as well as selectively suppressing or retaining single-vehicle features, thereby mitigating information mismatches. We then design a multi-scale fusion module to handle multi-scale feature maps processed by DFTM. Compared to existing works that also consider asynchronous collaborative perception, CoDynTrust combats various low-quality information in temporally asynchronous scenarios and allows uncertainty to be propagated to downstream tasks such as planning and control. Experimental results demonstrate that CoDynTrust significantly reduces performance degradation caused by temporal asynchrony across multiple datasets, achieving state-of-the-art detection performance even with temporal asynchrony. The code is available at https://github.com/CrazyShout/CoDynTrust.
Chinese: 协作感知通过融合多智能体信息扩展感知范围并提升性能,但现实环境中的时间异步性会导致信息不匹配;为解决此问题,提出的CoDynTrust框架利用不确定性建模和动态特征信任机制来缓解这些影响,从而提高检测性能。
English: Collaborative perception, which fuses data from multiple agents, can enhance perception range and performance but is often compromised by temporal asynchrony, leading to mismatches; to address this, the proposed CoDynTrust framework uses uncertainty modeling and dynamic feature trust to mitigate these issues and improve detection performance.

Authors:Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, Qingyun Pan
Title: SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation
Abstract:
As a powerful all-weather Earth observation tool, synthetic aperture radar (SAR) remote sensing enables critical military reconnaissance, maritime surveillance, and infrastructure monitoring. Although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified. The project will be released at https://github.com/JimmyMa99/SARChat.
中文摘要:本文创新性地提出了首个面向SAR图像的大规模多模态对话数据集SARChat-2M,包含约200万高质量图文对,为评估和提升视觉语言模型在专业遥感领域的解析能力提供了基准框架。
English Summary: This paper introduces SARChat-2M, the first large-scale multimodal dialogue dataset for SAR images containing 2 million image-text pairs, which enables and evaluates vision-language models' capabilities in SAR interpretation across diverse scenarios.

Authors:Tingyi Cai, Yunliang Jiang, Yixin Liu, Ming Li, Changqin Huang, Shirui Pan
Title: Out-of-Distribution Detection on Graphs: A Survey
Abstract:
Graph machine learning has witnessed rapid growth, driving advancements across diverse domains. However, the in-distribution assumption, where training and testing data share the same distribution, often breaks in real-world scenarios, leading to degraded model performance under distribution shifts. This challenge has catalyzed interest in graph out-of-distribution (GOOD) detection, which focuses on identifying graph data that deviates from the distribution seen during training, thereby enhancing model robustness. In this paper, we provide a rigorous definition of GOOD detection and systematically categorize existing methods into four types: enhancement-based, reconstruction-based, information propagation-based, and classification-based approaches. We analyze the principles and mechanisms of each approach and clarify the distinctions between GOOD detection and related fields, such as graph anomaly detection, outlier detection, and GOOD generalization. Beyond methodology, we discuss practical applications and theoretical foundations, highlighting the unique challenges posed by graph data. Finally, we discuss the primary challenges and propose future directions to advance this emerging field. The repository of this survey is available at https://github.com/ca1man-2022/Awesome-GOOD-Detection.
Graph out-of-distribution (GOOD) detection addresses performance degradation caused by distribution shifts in graph machine learning by systematically categorizing methods into four approaches and distinguishing it from related fields while outlining future research directions.
English Summary:

Authors:Wooseong Yang, Hyesu Jang, Ayoung Kim
Title: Ground-Optimized 4D Radar-Inertial Odometry via Continuous Velocity Integration using Gaussian Process
Abstract:
Radar ensures robust sensing capabilities in adverse weather conditions, yet challenges remain due to its high inherent noise level. Existing radar odometry has overcome these challenges with strategies such as filtering spurious points, exploiting Doppler velocity, or integrating with inertial measurements. This paper presents two novel improvements beyond the existing radar-inertial odometry: ground-optimized noise filtering and continuous velocity preintegration. Despite the widespread use of ground planes in LiDAR odometry, imprecise ground point distributions of radar measurements cause naive plane fitting to fail. Unlike plane fitting in LiDAR, we introduce a zone-based uncertainty-aware ground modeling specifically designed for radar. Secondly, we note that radar velocity measurements can be better combined with IMU for a more accurate preintegration in radar-inertial odometry. Existing methods often ignore temporal discrepancies between radar and IMU by simplifying the complexities of asynchronous data streams with discretized propagation models. Tackling this issue, we leverage GP and formulate a continuous preintegration method for tightly integrating 3-DOF linear velocity with IMU, facilitating full 6-DOF motion directly from the raw measurements. Our approach demonstrates remarkable performance (less than 1% vertical drift) in public datasets with meticulous conditions, illustrating substantial improvement in elevation accuracy. The code will be released as open source for the community: https://github.com/wooseongY/Go-RIO.
Chinese: 本文提出了雷达惯性里程计的两项创新改进:基于区域的不确定性感知地面建模以优化噪声过滤,以及利用高斯过程的连续速度预积分方法,在恶劣条件下实现了低于1%的垂直漂移。
English: This paper introduces two novel enhancements to radar-inertial odometry—a zone-based uncertainty-aware ground modeling for improved noise filtering and a continuous velocity preintegration method using Gaussian processes—achieving less than 1% vertical drift in challenging conditions.

Authors:Mingyu Xing, Lechao Cheng, Shengeng Tang, Yaxiong Wang, Zhun Zhong, Meng Wang
Title: Knowledge Swapping via Learning and Unlearning
Abstract:
We introduce \textbf{Knowledge Swapping}, a novel task designed to selectively regulate knowledge of a pretrained model by enabling the forgetting of user\-specified information, retaining essential knowledge, and acquiring new knowledge simultaneously. By delving into the analysis of knock-on feature hierarchy, we find that incremental learning typically progresses from low\-level representations to higher\-level semantics, whereas forgetting tends to occur in the opposite direction\-starting from high-level semantics and moving down to low-level features. Building upon this, we propose to benchmark the knowledge swapping task with the strategy of \textit{Learning Before Forgetting}. Comprehensive experiments on various tasks like image classification, object detection, and semantic segmentation validate the effectiveness of the proposed strategy. The source code is available at \href{https://github.com/xingmingyu123456/KnowledgeSwapping}{https://github.com/xingmingyu123456/KnowledgeSwapping}.
中文: 该研究提出了知识交换任务,通过“先学习后遗忘”策略,使预训练模型能够同时遗忘特定信息、保留关键知识并学习新内容,并在多项任务中验证了其有效性。
English: The study introduces Knowledge Swapping, a task that enables pretrained models to forget specific information, retain essential knowledge, and acquire new knowledge simultaneously, using a "Learning Before Forgetting" strategy validated across multiple tasks.

Authors:Yunhang He, Cong Xu, Jun Wang, Wei Zhang
Title: Collaborative Filtering Meets Spectrum Shift: Connecting User-Item Interaction with Graph-Structured Side Information
Abstract:
Graph Neural Networks (GNNs) have demonstrated their superiority in collaborative filtering, where the user-item (U-I) interaction bipartite graph serves as the fundamental data format. However, when graph-structured side information (e.g., multimodal similarity graphs or social networks) is integrated into the U-I bipartite graph, existing graph collaborative filtering methods fall short of achieving satisfactory performance. We quantitatively analyze this problem from a spectral perspective. Recall that a bipartite graph possesses a full spectrum within the range of [-1, 1], with the highest frequency exactly achievable at -1 and the lowest frequency at 1; however, we observe as more side information is incorporated, the highest frequency of the augmented adjacency matrix progressively shifts rightward. This spectrum shift phenomenon has caused previous approaches built for the full spectrum [-1, 1] to assign mismatched importance to different frequencies. To this end, we propose Spectrum Shift Correction (dubbed SSC), incorporating shifting and scaling factors to enable spectral GNNs to adapt to the shifted spectrum. Unlike previous paradigms of leveraging side information, which necessitate tailored designs for diverse data types, SSC directly connects traditional graph collaborative filtering with any graph-structured side information. Experiments on social and multimodal recommendation demonstrate the effectiveness of SSC, achieving relative improvements of up to 23% without incurring any additional computational overhead. Our code is available at https://github.com/yhhe2004/SSC-KDD.
中文摘要:图神经网络在整合图结构侧信息到用户-项目二分图时因频谱偏移问题表现不佳,为此提出的频谱偏移校正方法(SSC)通过调整频谱参数使图神经网络适应偏移后的频谱,在社交和多模态推荐中实现最高23%的性能提升且无需额外计算开销。
English Summary: Graph Neural Networks struggle with integrating graph-structured side information into user-item bipartite graphs due to spectral shifts, prompting the development of Spectrum Shift Correction (SSC) to adapt spectral GNNs and achieve up to 23% performance improvement without extra computational cost.

Authors:Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou
Title: WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point
Abstract:
GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to the sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state, often lead to planning errors. This issue is widespread in real application scenarios, but existing benchmarks fail to evaluate it. To address this gap, we introduce WorldGUI, a comprehensive GUI benchmark containing tasks across ten widely used desktop and web applications (e.g., PowerPoint, VSCode, Acrobat), each instantiated with diverse initial states to simulate authentic human-computer interactions. Complementing this, we propose WorldGUI-Agent, a universal framework that unifies three core modules: Planner-Critic for high-level plan refinement, Step-Check for intermediate verification, and Actor-Critic for action-level optimization to proactively detect and correct errors. Experimental evaluation shows that WorldGUI-Agent outperforms the outstanding existing model (Claude-3.5 Computer Use) by 12.4% in success rate on WorldGUI, and achieves a 31.2% overall success rate on WindowsAgentArena, surpassing the prior state-of-the-art by 11.7%. Our analysis further reveals that dynamic augmentation tasks and desktop environments pose substantial hurdles, underscoring the necessity of adaptive planning and feedback-driven execution for advancing real-world GUI automation. The code and data are available at https://github.com/showlab/WorldGUI.
中文: WorldGUI通过包含多样化初始状态的基准和WorldGUI-Agent框架,解决了GUI代理因环境初始状态差异导致的规划难题,显著提升了任务成功率。
English: WorldGUI addresses GUI agents' planning challenges caused by varying initial states through a comprehensive benchmark and the WorldGUI-Agent framework, which improves success rates significantly over existing models.

Authors:Kristofer Grover Roos, Atsushi Fukuda, Quan Huu Cap
Title: From Brainwaves to Brain Scans: A Robust Neural Network for EEG-to-fMRI Synthesis
Abstract:
While functional magnetic resonance imaging (fMRI) offers valuable insights into brain activity, it is limited by high operational costs and significant infrastructural demands. In contrast, electroencephalography (EEG) provides millisecond-level precision in capturing electrical activity but lacks the spatial fidelity necessary for precise neural localization. To bridge these gaps, we propose E2fNet, a simple yet effective deep learning model for synthesizing fMRI images from low-cost EEG data. E2fNet is an encoder-decoder network specifically designed to capture and translate meaningful multi-scale features from EEG across electrode channels into accurate fMRI representations. Extensive evaluations across three public datasets demonstrate that E2fNet consistently outperforms existing CNN- and transformer-based methods, achieving state-of-the-art results in terms of the structural similarity index measure (SSIM). These results demonstrate that E2fNet is a promising, cost-effective solution for enhancing neuroimaging capabilities. The code is available at https://github.com/kgr20/E2fNet.
中文: E2fNet是一种创新的深度学习模型,能够通过经济高效的脑电图数据合成高质量的功能磁共振成像图像,在多个数据集上均展现出超越现有方法的卓越性能。
English: E2fNet is a novel deep learning model that synthesizes high-quality fMRI images from cost-effective EEG data, achieving superior performance over existing methods across multiple datasets.

Authors:Víctor Gallego
Title: MetaSC: Test-Time Safety Specification Optimization for Language Models
Abstract:
We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code released at https://github.com/vicgalle/meta-self-critique.git .
中文: 本文提出了一种新颖的动态安全框架,通过元批判机制在推理时迭代优化安全提示,无需修改模型权重即可显著提升语言模型对抗恶意攻击和通用安全任务的表现。
English: This paper introduces a dynamic safety framework that enhances language model safety through a meta-critique mechanism, which iteratively refines safety prompts during inference to improve performance against adversarial attacks and general safety tasks without altering model weights.

Authors:Xiaofei Wang, Hanyu Liu, Yupei Zhang, Boyang Zhao, Hao Duan, Wanming Hu, Yonggao Mou, Stephen Price, Chao Li
Title: Joint Modelling Histology and Molecular Markers for Cancer Classification
Abstract:
Cancers are characterized by remarkable heterogeneity and diverse prognosis. Accurate cancer classification is essential for patient stratification and clinical decision-making. Although digital pathology has been advancing cancer diagnosis and prognosis, the paradigm in cancer pathology has shifted from purely relying on histology features to incorporating molecular markers. There is an urgent need for digital pathology methods to meet the needs of the new paradigm. We introduce a novel digital pathology approach to jointly predict molecular markers and histology features and model their interactions for cancer classification. Firstly, to mitigate the challenge of cross-magnification information propagation, we propose a multi-scale disentangling module, enabling the extraction of multi-scale features from high-magnification (cellular-level) to low-magnification (tissue-level) whole slide images. Further, based on the multi-scale features, we propose an attention-based hierarchical multi-task multi-instance learning framework to simultaneously predict histology and molecular markers. Moreover, we propose a co-occurrence probability-based label correlation graph network to model the co-occurrence of molecular markers. Lastly, we design a cross-modal interaction module with the dynamic confidence constrain loss and a cross-modal gradient modulation strategy, to model the interactions of histology and molecular markers. Our experiments demonstrate that our method outperforms other state-of-the-art methods in classifying glioma, histology features and molecular markers. Our method promises to promote precise oncology with the potential to advance biomedical research and clinical applications. The code is available at https://github.com/LHY1007/M3C2
中文: 本研究提出一种新型数字病理方法,通过多尺度特征提取与跨模态交互技术联合预测组织学特征和分子标记,在胶质瘤分类中表现优异,有望推动精准肿瘤学发展。
English: This study introduces a novel digital pathology method that integrates multi-scale feature extraction and cross-modal interactions to jointly predict histology features and molecular markers, demonstrating superior performance in glioma classification and advancing precision oncology.

Authors:Zach Nussbaum, Brandon Duderstadt
Title: Training Sparse Mixture Of Experts Text Embedding Models
Abstract:
Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models' increased memory requirements constrain dataset ingestion capacity, and their higher latency directly impacts query-time performance. While causal language models have addressed similar efficiency challenges using Mixture of Experts (MoE) architectures, this approach hasn't been successfully adapted to the general text embedding setting. In this paper, we introduce Nomic Embed v2, the first general purpose MoE text embedding model. Our model outperforms models in the same parameter class on both monolingual and multilingual benchmarks while also maintaining competitive performance with models twice its size. We open-source all code, models, and evaluation data to ensure full reproducibility of our training pipeline at \href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}.
中文: 基于Transformer的文本嵌入模型因规模扩大面临部署难题,而新型Nomic Embed v2作为首个MoE架构模型,性能超越同类并媲美更大模型,同时开源确保可复现性。
English: Transformer-based text embedding models face deployment challenges due to scaling, but the new Nomic Embed v2, the first MoE-based model, outperforms peers and matches larger models while being open-sourced for reproducibility.

Authors:Anthony D. Blaom, Samuel Okon
Title: New tools for comparing classical and neural ODE models for tumor growth
Abstract:
A new computational tool TumorGrowth$.$jl for modeling tumor growth is introduced. The tool allows the comparison of standard textbook models, such as General Bertalanffy and Gompertz, with some newer models, including, for the first time, neural ODE models. As an application, we revisit a human meta-study of non-small cell lung cancer and bladder cancer lesions, in patients undergoing two different treatment options, to determine if previously reported performance differences are statistically significant, and if newer, more complex models perform any better. In a population of examples with at least four time-volume measurements available for calibration, and an average of about 6.3, our main conclusion is that the General Bertalanffy model has superior performance, on average. However, where more measurements are available, we argue that more complex models, capable of capturing rebound and relapse behavior, may be better choices.
中文: 新推出的计算工具TumorGrowth.jl用于模拟肿瘤生长,能够比较通用Bertalanffy和Gompertz等标准模型与新型神经ODE模型;研究发现,在数据有限的情况下通用Bertalanffy模型平均表现更优,但当测量数据充足时,能捕捉复发行为的复杂模型可能是更好选择。
English: A new computational tool, TumorGrowth.jl, is introduced for modeling tumor growth, enabling comparison of standard models like General Bertalanffy and Gompertz with newer neural ODE models, with findings showing General Bertalanffy's superior average performance in cases with limited data but suggesting complex models may be preferable when more measurements are available to capture rebound and relapse behavior.

Authors:Ashkan Shahbazi, Elaheh Akbari, Darian Salehi, Xinran Liu, Navid Naderializadeh, Soheil Kolouri
Title: ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans
Abstract:
While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at https://github.com/dariansal/ESPFormer.
Chinese: 本文提出了一种基于切片最优传输的新型全并行双随机注意力机制,无需迭代Sinkhorn归一化即可提高效率并保持可微性,在多种基准任务中均实现了性能提升。
English: This paper introduces a novel, fully parallelizable doubly-stochastic attention mechanism using sliced optimal transport, which eliminates the need for iterative Sinkhorn normalization and improves efficiency while maintaining differentiability, leading to enhanced performance across various benchmark tasks.

Authors:Fanxu Meng, Pingzhi Tang, Xiaojuan Tang, Zengwei Yao, Xing Sun, Muhan Zhang
Title: TransMLA: Multi-Head Latent Attention Is All You Need
Abstract:
In this paper, we present TransMLA, a framework that seamlessly converts any GQA-based pre-trained model into an MLA-based model. Our approach enables direct compatibility with DeepSeek's codebase, allowing these models to fully leverage DeepSeek-specific optimizations such as vLLM and SGlang. By compressing 93% of the KV cache in LLaMA-2-7B, TransMLA achieves a 10.6x inference speedup at an 8K context length while preserving meaningful output quality. Additionally, the model requires only 6 billion tokens for fine-tuning to regain performance on par with the original across multiple benchmarks. TransMLA offers a practical solution for migrating GQA-based models to the MLA structure. When combined with DeepSeek's advanced features, such as FP8 quantization and Multi-Token Prediction, even greater inference acceleration can be realized.
中文:TransMLA框架可将GQA预训练模型无缝转换为MLA架构,在压缩93% KV缓存的同时实现10.6倍推理加速,并通过少量微调即可恢复原有性能表现。
English: TransMLA enables seamless conversion of GQA-based models to MLA architecture, achieving a 10.6x inference speedup with 93% KV cache compression while maintaining performance parity after minimal fine-tuning.

Authors:Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim, Jong Hwan Ko
Title: Column-wise Quantization of Weights and Partial Sums for Accurate and Efficient Compute-In-Memory Accelerators
Abstract:
Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, imposed by cell limitations and the need for multiple cells for higher-bit weights, present further challenges. While fine-grained partial-sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums efficiently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robustness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at https://github.com/jiyoonkm/ColumnQuant.
中文摘要:本研究提出一种列级量化方法,通过统一权重和部分和的量化粒度,在保持硬件效率的同时,有效提升了存内计算深度神经网络的精度与鲁棒性。
English Summary: This study introduces a column-wise quantization method that aligns weight and partial-sum granularities to enhance accuracy and robustness in compute-in-memory DNNs while maintaining hardware efficiency.

Authors:Mark Schöne, Babak Rahmani, Heiner Kremer, Fabian Falck, Hitesh Ballani, Jannes Gladrow
Title: Implicit Language Models are RNNs: Balancing Parallelization and Expressivity
Abstract:
State-space models (SSMs) and transformers dominate the language modeling landscape. However, they are constrained to a lower computational complexity than classical recurrent neural networks (RNNs), limiting their expressivity. In contrast, RNNs lack parallelization during training, raising fundamental questions about the trade off between parallelization and expressivity. We propose implicit SSMs, which iterate a transformation until convergence to a fixed point. Theoretically, we show that implicit SSMs implement the non-linear state-transitions of RNNs. Empirically, we find that only approximate fixed-point convergence suffices, enabling the design of a scalable training curriculum that largely retains parallelization, with full convergence required only for a small subset of tokens. Our approach demonstrates superior state-tracking capabilities on regular languages, surpassing transformers and SSMs. We further scale implicit SSMs to natural language reasoning tasks and pretraining of large-scale language models up to 1.3B parameters on 207B tokens representing, to our knowledge, the largest implicit model trained to date. Notably, our implicit models outperform their explicit counterparts on standard benchmarks. Our code is publicly available at http://github.com/microsoft/implicit_languagemodels .
中文摘要:隐式状态空间模型通过近似定点收敛实现了类似RNN的非线性状态转换,在保持并行化训练的同时解决了传统模型表达能力受限的问题,在合成语言和自然语言任务上均展现出卓越性能。
English Summary: Implicit state-space models overcome the expressivity limitations of transformers and SSMs by implementing RNN-like non-linear transitions while maintaining parallelization through approximate fixed-point convergence, achieving superior performance on both synthetic and natural language tasks.

Authors:Ao Liang, Haiyang Hua, Jian Fang, Wenyu Chen, Huaici Zhao
Title: PDM-SSD: Single-Stage Three-Dimensional Object Detector With Point Dilation
Abstract:
Current Point-based detectors can only learn from the provided points, with limited receptive fields and insufficient global learning capabilities for such targets. In this paper, we present a novel Point Dilation Mechanism for single-stage 3D detection (PDM-SSD) that takes advantage of these two representations. Specifically, we first use a PointNet-style 3D backbone for efficient feature encoding. Then, a neck with Point Dilation Mechanism (PDM) is used to expand the feature space, which involves two key steps: point dilation and feature filling. The former expands points to a certain size grid centered around the sampled points in Euclidean space. The latter fills the unoccupied grid with feature for backpropagation using spherical harmonic coefficients and Gaussian density function in terms of direction and scale. Next, we associate multiple dilation centers and fuse coefficients to obtain sparse grid features through height compression. Finally, we design a hybrid detection head for joint learning, where on one hand, the scene heatmap is predicted to complement the voting point set for improved detection accuracy, and on the other hand, the target probability of detected boxes are calibrated through feature fusion. On the challenging Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset, PDM-SSD achieves state-of-the-art results for multi-class detection among single-modal methods with an inference speed of 68 frames. We also demonstrate the advantages of PDM-SSD in detecting sparse and incomplete objects through numerous object-level instances. Additionally, PDM can serve as an auxiliary network to establish a connection between sampling points and object centers, thereby improving the accuracy of the model without sacrificing inference speed. Our code will be available at https://github.com/AlanLiangC/PDM-SSD.git.
中文: 本文提出了一种用于单阶段3D检测的点扩展机制,通过点扩展和特征填充扩大特征空间,在KITTI数据集上实现了最先进的多类别检测效果,同时保持了较高的推理速度。
English: This paper introduces a Point Dilation Mechanism (PDM) for single-stage 3D detection that expands feature space through point dilation and feature filling, achieving state-of-the-art multi-class detection results on the KITTI dataset while maintaining high inference speed.

Authors:Dongsu Song, Daehwa Ko, Jay Hoon Jung
Title: Amnesia as a Catalyst for Enhancing Black Box Pixel Attacks in Image Classification and Object Detection
Abstract:
It is well known that query-based attacks tend to have relatively higher success rates in adversarial black-box attacks. While research on black-box attacks is actively being conducted, relatively few studies have focused on pixel attacks that target only a limited number of pixels. In image classification, query-based pixel attacks often rely on patches, which heavily depend on randomness and neglect the fact that scattered pixels are more suitable for adversarial attacks. Moreover, to the best of our knowledge, query-based pixel attacks have not been explored in the field of object detection. To address these issues, we propose a novel pixel-based black-box attack called Remember and Forget Pixel Attack using Reinforcement Learning(RFPAR), consisting of two main components: the Remember and Forget processes. RFPAR mitigates randomness and avoids patch dependency by leveraging rewards generated through a one-step RL algorithm to perturb pixels. RFPAR effectively creates perturbed images that minimize the confidence scores while adhering to limited pixel constraints. Furthermore, we advance our proposed attack beyond image classification to object detection, where RFPAR reduces the confidence scores of detected objects to avoid detection. Experiments on the ImageNet-1K dataset for classification show that RFPAR outperformed state-of-the-art query-based pixel attacks. For object detection, using the MSCOCO dataset with YOLOv8 and DDQ, RFPAR demonstrates comparable mAP reduction to state-of-the-art query-based attack while requiring fewer query. Further experiments on the Argoverse dataset using YOLOv8 confirm that RFPAR effectively removed objects on a larger scale dataset. Our code is available at https://github.com/KAU-QuantumAILab/RFPAR.
中文: 本文提出RFPAR这一基于强化学习的像素攻击方法,通过记忆-遗忘机制降低黑盒攻击中的随机性和补丁依赖性,在图像分类中表现优异,并在目标检测任务中以更少查询实现有效目标消除。
English: This paper introduces RFPAR, a reinforcement learning-based pixel attack method that reduces randomness and patch dependency in black-box attacks, demonstrating superior performance in image classification and effective object removal in detection tasks with fewer queries.

Authors:Le-Anh Tran
Title: Unpaired Image Dehazing via Kolmogorov-Arnold Transformation of Latent Features
Abstract:
This paper proposes an innovative framework for Unsupervised Image Dehazing via Kolmogorov-Arnold Transformation, termed UID-KAT. Image dehazing is recognized as a challenging and ill-posed vision task that requires complex transformations and interpretations in the feature space. Recent advancements have introduced Kolmogorov-Arnold Networks (KANs), inspired by the Kolmogorov-Arnold representation theorem, as promising alternatives to Multi-Layer Perceptrons (MLPs) since KANs can leverage their polynomial foundation to more efficiently approximate complex functions while requiring fewer layers than MLPs. Motivated by this potential, this paper explores the use of KANs combined with adversarial training and contrastive learning to model the intricate relationship between hazy and clear images. Adversarial training is employed due to its capacity in producing high-fidelity images, and contrastive learning promotes the model's emphasis on significant features while suppressing the influence of irrelevant information. The proposed UID-KAT framework is trained in an unsupervised setting to take advantage of the abundance of real-world data and address the challenge of preparing paired hazy/clean images. Experimental results show that UID-KAT achieves state-of-the-art dehazing performance across multiple datasets and scenarios, outperforming existing unpaired methods while reducing model complexity. The source code for this work is publicly available at https://github.com/tranleanh/uid-kat.
中文: 本文提出UID-KAT无监督图像去雾框架,通过结合Kolmogorov-Arnold网络与对抗训练和对比学习,有效建模有雾与清晰图像间的复杂映射关系,在降低模型复杂度的同时实现了最先进的去雾性能。
English: This paper introduces UID-KAT, an unsupervised image dehazing framework that utilizes Kolmogorov-Arnold Networks with adversarial and contrastive learning to effectively model the complex mapping between hazy and clear images, achieving superior performance with reduced model complexity.

Authors:Ivan Lopes, Valentin Deschaintre, Yannick Hold-Geoffroy, Raoul de Charette
Title: MatSwap: Light-aware material transfers in images
Abstract:
We present MatSwap, a method to transfer materials to designated surfaces in an image photorealistically. Such a task is non-trivial due to the large entanglement of material appearance, geometry, and lighting in a photograph. In the literature, material editing methods typically rely on either cumbersome text engineering or extensive manual annotations requiring artist knowledge and 3D scene properties that are impractical to obtain. In contrast, we propose to directly learn the relationship between the input material -- as observed on a flat surface -- and its appearance within the scene, without the need for explicit UV mapping. To achieve this, we rely on a custom light- and geometry-aware diffusion model. We fine-tune a large-scale pre-trained text-to-image model for material transfer using our synthetic dataset, preserving its strong priors to ensure effective generalization to real images. As a result, our method seamlessly integrates a desired material into the target location in the photograph while retaining the identity of the scene. We evaluate our method on synthetic and real images and show that it compares favorably to recent work both qualitatively and quantitatively. We release our code and data on https://github.com/astra-vision/MatSwap
中文: MatSwap是一种通过定制扩散模型直接学习材质外观关系的真实感材质替换方法,无需三维场景属性即可将指定材质无缝融入目标图像。
English: MatSwap is a photorealistic material transfer method that uses a custom diffusion model to learn material appearance relationships without needing 3D scene properties, enabling seamless integration of materials into target images.

Authors:Leyang Hu, Matteo Gamba, Randall Balestriero
Title: Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
Abstract:
The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights to activation functions, viewing them through the lens of spline operators. We propose Curvature Tuning (CT), an interpretable and principled steering method that modulates a model's decision boundary by injecting a single hyperparameter into its activation functions. We show that CT provably adjusts model decision boundary curvature and, more fundamentally, projects a model onto a space of smooth functions-thereby complementing current finetuning methods, whose effect lies primarily in feature adaptation. Making this hyperparameter trainable gives rise to a novel and highly parameter-efficient finetuning method. Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of ResNet-50/152 by 7.14%/8.46% over linear probing and 4.64%/1.70% over LoRA across 12 datasets, and improves robust accuracy on the $\ell_\infty$ benchmark from RobustBench by 1032.64%/1494.46%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning.
Chinese: 本文提出曲率调优(CT)方法,通过向激活函数引入单一超参数来调整模型决策边界,在多个数据集上显著提升了模型的泛化能力和鲁棒性。
English: This paper introduces Curvature Tuning (CT), an interpretable method that adjusts a model's decision boundary by modifying activation functions with a single hyperparameter, enhancing both generalization and robustness across multiple datasets.

Authors:Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh
Title: DarwinLM: Evolutionary Structured Pruning of Large Language Models
Abstract:
Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose DarwinLM, a method for training-aware structured pruning. DarwinLM builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, DarwinLM surpasses ShearedLlama while requiring 5x less training data during post-compression training. Code is at: https://github.com/IST-DASLab/DarwinLM
Chinese: DarwinLM是一种训练感知的结构化剪枝方法,通过进化搜索和多阶段训练,在降低计算成本的同时高效压缩大语言模型并保持优异性能。
English: DarwinLM is a training-aware structured pruning method that uses evolutionary search and multistep training to efficiently compress large language models while maintaining high performance with reduced computational costs.

Authors:Anshul Nasery, Jonathan Hayase, Creston Brooks, Peiyao Sheng, Himanshu Tyagi, Pramod Viswanath, Sewoong Oh
Title: Scalable Fingerprinting of Large Language Models
Abstract:
Model fingerprinting has emerged as a powerful tool for model owners to identify their shared model given API access. However, to lower false discovery rate, fight fingerprint leakage, and defend against coalitions of model users attempting to bypass detection, we argue that {\em scalability} is critical, i.e., scaling up the number of fingerprints one can embed into a model. Hence, we pose scalability as a crucial requirement for fingerprinting schemes. We experiment with fingerprint design at a scale significantly larger than previously considered, and introduce a new method, dubbed Perinucleus sampling, to generate scalable, persistent, and harmless fingerprints. We demonstrate that this scheme can add 24,576 fingerprints to a Llama-3.1-8B model -- two orders of magnitude more than existing schemes -- without degrading the model's utility. Our inserted fingerprints persist even after supervised fine-tuning on standard post-training data. We further address security risks for fingerprinting, and theoretically and empirically show how a scalable fingerprinting scheme like ours can mitigate these risks. Our code is available at https://github.com/SewoongLab/scalable-fingerprinting-of-llms
中文摘要:本文提出Perinucleus采样方法,实现了可扩展的模型指纹技术,能在Llama-3.1-8B模型中嵌入24,576个指纹且不损害模型性能,同时通过理论与实证研究证明了该方案对安全风险的缓解作用。
English Summary: This paper introduces Perinucleus sampling, a scalable fingerprinting method that embeds 24,576 fingerprints into Llama-3.1-8B models without compromising utility, while addressing security risks through theoretical and empirical validation.

Authors:Liang Wu, Wei Xiao, Richard D. Braatz
Title: EIQP: Execution-time-certified and Infeasibility-detecting QP Solver
Abstract:
Solving real-time quadratic programming (QP) is a ubiquitous task in control engineering, such as in model predictive control and control barrier function-based QP. In such real-time scenarios, certifying that the employed QP algorithm can either return a solution within a predefined level of optimality or detect QP infeasibility before the predefined sampling time is a pressing requirement. This article considers convex QP (including linear programming) and adopts its homogeneous formulation to achieve infeasibility detection. Exploiting this homogeneous formulation, this article proposes a novel infeasible interior-point method (IPM) algorithm with the best theoretical $O(\sqrt{n})$ iteration complexity that feasible IPM algorithms enjoy. The iteration complexity is proved to be \textit{exact} (rather than an upper bound), \textit{simple to calculate}, and \textit{data independent}, with the value $\left\lceil\frac{\log(\frac{n+1}ε)}{-\log(1-\frac{0.414213}{\sqrt{n+1}})}\right\rceil$ (where $n$ and $ε$ denote the number of constraints and the predefined optimality level, respectively), making it appealing to certify the execution time of online time-varying convex QPs. The proposed algorithm is simple to implement without requiring a line search procedure (uses the full Newton step), and its C-code implementation (offering MATLAB, Julia, and Python interfaces) and numerical examples are publicly available at https://github.com/liangwu2019/EIQP.
本文提出了一种新颖的不可行内点法,用于实时凸二次规划,实现了精确的O(√n)迭代复杂度,具备不可行性检测功能且无需线搜索过程。
This article presents a novel infeasible interior-point method for real-time convex quadratic programming that achieves exact O(√n) iteration complexity with infeasibility detection and requires no line search.

Authors:Bing Fan, Yunhe Feng, Yapeng Tian, James Chenhao Liang, Yuewei Lin, Yan Huang, Heng Fan
Title: PRVQL: Progressive Knowledge-guided Refinement for Robust Egocentric Visual Query Localization
Abstract:
Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from first-person videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances EgoVQL in complicated scenes. In our experiments on challenging Ego4D, PRVQL achieves state-of-the-art result and largely surpasses other methods, showing its efficacy. Our code, model and results will be released at https://github.com/fb-reps/PRVQL.
中文: PRVQL提出了一种渐进式知识引导优化框架,通过从视频中迭代提取并利用目标相关知识来优化查询和视频特征,从而在复杂场景中显著提升了第一人称视觉查询定位的准确率,并在Ego4D数据集上取得了领先性能。
English: PRVQL introduces a progressive knowledge-guided refinement framework that enhances egocentric visual query localization by iteratively extracting and utilizing target-specific knowledge from videos to refine features, achieving state-of-the-art performance on complex datasets like Ego4D.

Authors:Chiyun Noh, Wooseong Yang, Minwoo Jung, Sangwoo Jung, Ayoung Kim
Title: GaRLIO: Gravity enhanced Radar-LiDAR-Inertial Odometry
Abstract:
Recently, gravity has been highlighted as a crucial constraint for state estimation to alleviate potential vertical drift. Existing online gravity estimation methods rely on pose estimation combined with IMU measurements, which is considered best practice when direct velocity measurements are unavailable. However, with radar sensors providing direct velocity data-a measurement not yet utilized for gravity estimation-we found a significant opportunity to improve gravity estimation accuracy substantially. GaRLIO, the proposed gravity-enhanced Radar-LiDAR-Inertial Odometry, can robustly predict gravity to reduce vertical drift while simultaneously enhancing state estimation performance using pointwise velocity measurements. Furthermore, GaRLIO ensures robustness in dynamic environments by utilizing radar to remove dynamic objects from LiDAR point clouds. Our method is validated through experiments in various environments prone to vertical drift, demonstrating superior performance compared to traditional LiDAR-Inertial Odometry methods. We make our source code publicly available to encourage further research and development. https://github.com/ChiyunNoh/GaRLIO
中文: GaRLIO是一种重力增强型雷达-激光雷达-惯性里程计系统,利用雷达直接速度测量显著提升重力估计精度,有效减少垂直漂移并增强状态估计性能,同时确保在动态环境中的鲁棒性。
English: GaRLIO is a gravity-enhanced Radar-LiDAR-Inertial Odometry system that utilizes direct velocity measurements from radar to significantly improve gravity estimation accuracy, reduce vertical drift, and enhance state estimation performance while ensuring robustness in dynamic environments.

Authors:Hongwei Yi, Shitong Shao, Tian Ye, Jiantong Zhao, Qingyu Yin, Michael Lingelbach, Li Yuan, Yonghong Tian, Enze Xie, Daquan Zhou
Title: Magic 1-For-1: Generating One Minute Video Clips within One Minute
Abstract:
In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.
中文:Magic141是一种高效的视频生成模型,通过将文本到视频任务分解为文本到图像和图像到视频两个步骤,并运用多种优化技术降低计算成本和延迟,从而快速生成高质量视频。
English: Magic141 is an efficient video generation model that simplifies text-to-video creation by splitting it into text-to-image and image-to-video tasks, using optimization techniques to reduce memory and latency while maintaining quality.

Authors:Song Liu, Leyang Wang, Yakun Wang
Title: Guiding Time-Varying Generative Models with Natural Gradients on Exponential Family Manifold
Abstract:
Optimising probabilistic models is a well-studied field in statistics. However, its connection with the training of generative models remains largely under-explored. In this paper, we show that the evolution of time-varying generative models can be projected onto an exponential family manifold, naturally creating a link between the parameters of a generative model and those of a probabilistic model. We then train the generative model by moving its projection on the manifold according to the natural gradient descent scheme. This approach also allows us to efficiently approximate the natural gradient of the KL divergence without relying on MCMC for intractable models. Furthermore, we propose particle versions of the algorithm, which feature closed-form update rules for any parametric model within the exponential family. Through toy and real-world experiments, we validate the effectiveness of the proposed algorithms. The code of the proposed algorithms can be found at https://github.com/anewgithubname/iNGD.
Chinese: 本文提出一种方法,将时变生成模型的演化投影到指数族流形上,建立其与概率模型的联系,并通过自然梯度下降进行训练,无需MCMC即可高效近似KL散度。
English: This paper introduces a method that projects the evolution of time-varying generative models onto an exponential family manifold, linking them to probabilistic models and enabling training via natural gradient descent without MCMC for efficient KL divergence approximation.

Authors:Zhaoting Li, Rodrigo Pérez-Dattari, Robert Babuska, Cosimo Della Santina, Jens Kober
Title: Beyond Behavior Cloning: Robustness through Interactive Imitation and Contrastive Learning
Abstract:
Behavior cloning (BC) traditionally relies on demonstration data, assuming the demonstrated actions are optimal. This can lead to overfitting under noisy data, particularly when expressive models are used (e.g., the energy-based model in Implicit BC). To address this, we extend behavior cloning into an iterative process of optimal action estimation within the Interactive Imitation Learning framework. Specifically, we introduce Contrastive policy Learning from Interactive Corrections (CLIC). CLIC leverages human corrections to estimate a set of desired actions and optimizes the policy to select actions from this set. Extensive simulation and real-robot experiments validate CLIC's advantages over existing state-of-the-art methods, including stable training of energy-based models, robustness to feedback noise, and adaptability to diverse feedback types beyond demonstrations. Our implementation is publicly available at https://github.com/clic-webpage/CLIC.
中文摘要:本文提出CLIC方法,通过人类纠正迭代优化行为克隆策略,在仿真和真实机器人实验中展现出优于现有技术的稳定性、抗噪能力和多类型反馈适应性。
English summary: The paper introduces CLIC, an iterative behavior cloning method that uses human corrections to estimate optimal actions, demonstrating superior performance in simulations and real-robot tests with enhanced stability, noise robustness, and feedback adaptability.

Authors:Yinzhe Shen, Omer Sahin Tas, Kaiwen Wang, Royden Wagner, Christoph Stiller
Title: Divide and Merge: Motion and Semantic Learning in End-to-End Autonomous Driving
Abstract:
Perceiving the environment and its changes over time corresponds to two fundamental yet heterogeneous types of information: semantics and motion. Previous end-to-end autonomous driving works represent both types of information in a single feature vector. However, including motion related tasks, such as prediction and planning, impairs detection and tracking performance, a phenomenon known as negative transfer in multi-task learning. To address this issue, we propose Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning. Specifically, we employ a set of learned motion queries that operate in parallel with detection and tracking queries, sharing a unified set of recursively updated reference points. Moreover, we employ interactive semantic decoding to enhance information exchange in semantic tasks, promoting positive transfer. Experiments on the nuScenes dataset with UniAD and SparseDrive confirm the effectiveness of our divide and merge approach, resulting in performance improvements across perception, prediction, and planning. Our code is available at https://github.com/shenyinzhe/DMAD.
中文: 本文提出神经贝叶斯运动解码方法,通过分离语义与运动学习的并行检测、跟踪及预测机制解决自动驾驶中的负迁移问题,在nuScenes数据集上实现了感知、预测与规划任务的全面性能提升。
English: This paper introduces Neural-Bayes motion decoding, a novel parallel detection, tracking, and prediction method that separates semantic and motion learning to mitigate negative transfer in autonomous driving, achieving performance improvements across perception, prediction, and planning on the nuScenes dataset.

Authors:Arvind Pillai, Dimitris Spathis, Subigya Nepal, Amanda C Collins, Daniel M Mackin, Michael V Heinz, Tess Z Griffin, Nicholas C Jacobson, Andrew Campbell
Title: Time2Lang: Bridging Time-Series Foundation Models and Large Language Models for Health Sensing Beyond Prompting
Abstract:
Large language models (LLMs) show promise for health applications when combined with behavioral sensing data. Traditional approaches convert sensor data into text prompts, but this process is prone to errors, computationally expensive, and requires domain expertise. These challenges are particularly acute when processing extended time series data. While time series foundation models (TFMs) have recently emerged as powerful tools for learning representations from temporal data, bridging TFMs and LLMs remains challenging. Here, we present Time2Lang, a framework that directly maps TFM outputs to LLM representations without intermediate text conversion. Our approach first trains on synthetic data using periodicity prediction as a pretext task, followed by evaluation on mental health classification tasks. We validate Time2Lang on two longitudinal wearable and mobile sensing datasets: daily depression prediction using step count data (17,251 days from 256 participants) and flourishing classification based on conversation duration (46 participants over 10 weeks). Time2Lang maintains near constant inference times regardless of input length, unlike traditional prompting methods. The generated embeddings preserve essential time-series characteristics such as auto-correlation. Our results demonstrate that TFMs and LLMs can be effectively integrated while minimizing information loss and enabling performance transfer across these distinct modeling paradigms. To our knowledge, we are the first to integrate a TFM and an LLM for health, thus establishing a foundation for future research combining general-purpose large models for complex healthcare tasks.
中文摘要:Time2Lang框架通过将时序基础模型的输出直接映射到大型语言模型表示,无需文本转换即可有效整合两种模型,在心理健康分类任务中实现恒定推理时间并保持时序特征,为医疗健康应用建立了新范式。
English Summary: The Time2Lang framework effectively bridges time series foundation models and large language models by directly mapping sensor data representations without text conversion, enabling efficient mental health classification with constant inference time and preserved temporal characteristics.

Authors:Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li
Title: DPO-Shift: Shifting the Distribution of Direct Preference Optimization
Abstract:
Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce DPO-Shift to controllably shift the distribution of the chosen probability. Then, we show that DPO-Shift exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of DPO-Shift over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.
中文: 本文提出DPO-Shift方法,通过可控地调整优选响应概率分布来解决直接偏好优化中的似然偏移问题,实验证明该方法在保持奖励边界与提升优选概率间存在权衡关系,并在下游任务中优于原始DPO算法。
English: This paper introduces DPO-Shift, a method to address the likelihood displacement issue in Direct Preference Optimization by controllably shifting the chosen response probability, demonstrating its superiority over DPO through improved performance on downstream tasks and a trade-off between chosen probability and reward margin.

Authors:Marten Lienen, Marcel Kollovieh, Stephan Günnemann
Title: Generative Modeling with Bayesian Sample Inference
Abstract:
We derive a novel generative model from iterative Gaussian posterior inference. By treating the generated sample as an unknown variable, we can formulate the sampling process in the language of Bayesian probability. Our model uses a sequence of prediction and posterior update steps to iteratively narrow down the unknown sample starting from a broad initial belief. In addition to a rigorous theoretical analysis, we establish a connection between our model and diffusion models and show that it includes Bayesian Flow Networks (BFNs) as a special case. In our experiments, we demonstrate that our model improves sample quality on ImageNet32 over both BFNs and the closely related Variational Diffusion Models, while achieving equivalent log-likelihoods on ImageNet32 and CIFAR10. Find our code at https://github.com/martenlienen/bsi.
中文: 本文提出了一种基于迭代高斯后验推断的新型生成模型,在ImageNet32上相比贝叶斯流网络和变分扩散模型提升了样本质量,同时保持了相当的似然性能。
English: This paper introduces a novel generative model based on iterative Gaussian posterior inference, which enhances sample quality on ImageNet32 compared to Bayesian Flow Networks and Variational Diffusion Models while maintaining comparable log-likelihood performance.

Authors:Cong Lu, Shengran Hu, Jeff Clune
Title: Automated Capability Discovery via Foundation Model Self-Exploration
Abstract:
Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of these abilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers a diverse spectrum of surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically generates thousands of distinct tasks, which are then clustered to reveal dozens of broader capability areas and failure modes, that would be challenging for any single team to uncover. We further validate our method's automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models' ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at https://github.com/conglu1997/ACD.
中文摘要:自动化能力发现(ACD)框架将一个基础模型作为科学家,通过开放式探索和自我评估,系统地生成和评估任务,自动揭示目标模型的广泛能力与风险。
English Summary: The Automated Capability Discovery (ACD) framework employs one foundation model as a scientist to systematically generate and evaluate tasks, automatically uncovering a wide range of capabilities and risks in subject models through open-ended exploration and self-assessment.

Authors:Fu-An Chao, Berlin Chen
Title: Towards Efficient and Multifaceted Computer-assisted Pronunciation Training Leveraging Hierarchical Selective State Space Model and Decoupled Cross-entropy Loss
Abstract:
Prior efforts in building computer-assisted pronunciation training (CAPT) systems often treat automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as separate fronts: the former aims to provide multiple pronunciation aspect scores across diverse linguistic levels, while the latter focuses instead on pinpointing the precise phonetic pronunciation errors made by non-native language learners. However, it is generally expected that a full-fledged CAPT system should perform both functionalities simultaneously and efficiently. In response to this surging demand, we in this work first propose HMamba, a novel CAPT approach that seamlessly integrates APA and MDD tasks in parallel. In addition, we introduce a novel loss function, decoupled cross-entropy loss (deXent), specifically tailored for MDD to facilitate better-supervised learning for detecting mispronounced phones, thereby enhancing overall performance. A comprehensive set of empirical results on the speechocean762 benchmark dataset demonstrates the effectiveness of our approach on APA. Notably, our proposed approach also yields a considerable improvement in MDD performance over a strong baseline, achieving an F1-score of 63.85%. Our codes are made available at https://github.com/Fuann/hmamba
中文摘要:本文提出HMamba新型计算机辅助发音训练系统,通过并行整合发音自动评估与误读检测诊断功能,并采用新型解耦交叉熵损失函数,在基准数据集上实现了性能的显著提升。
English Summary: This paper introduces HMamba, a novel CAPT system that integrates automatic pronunciation assessment and mispronunciation detection in parallel, enhanced by a new loss function that significantly improves performance on benchmark datasets.

Authors:Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
Title: LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
Abstract:
Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.
中文: 本文提出LASP-2这一新型序列并行方法,通过重构通信计算流程仅需单次AllGather操作,显著提升了长序列线性注意力Transformer训练的通信与计算并行效率。
English: This paper introduces LASP-2, a novel sequence parallelism method that enhances communication and computation parallelism for training linear attention transformers with long sequences by minimizing communication overhead through a redesigned workflow requiring only one AllGather operation.

Authors:Fangwen Wu, Lechao Cheng, Shengeng Tang, Xiaofeng Zhu, Chaowei Fang, Dingwen Zhang, Meng Wang
Title: Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning
Abstract:
Class-incremental learning (CIL) seeks to enable a model to sequentially learn new classes while retaining knowledge of previously learned ones. Balancing flexibility and stability remains a significant challenge, particularly when the task ID is unknown. To address this, our study reveals that the gap in feature distribution between novel and existing tasks is primarily driven by differences in mean and covariance moments. Building on this insight, we propose a novel semantic drift calibration method that incorporates mean shift compensation and covariance calibration. Specifically, we calculate each class's mean by averaging its sample embeddings and estimate task shifts using weighted embedding changes based on their proximity to the previous mean, effectively capturing mean shifts for all learned classes with each new task. We also apply Mahalanobis distance constraint for covariance calibration, aligning class-specific embedding covariances between old and current networks to mitigate the covariance shift. Additionally, we integrate a feature-level self-distillation approach to enhance generalization. Comprehensive experiments on commonly used datasets demonstrate the effectiveness of our approach. The source code is available at \href{https://github.com/fwu11/MACIL.git}{https://github.com/fwu11/MACIL.git}.
中文: 本研究针对类增量学习提出了一种语义漂移校准方法,通过均值漂移补偿和协方差校准来弥补特征分布差异,有效提升了任务间的稳定性和泛化能力。
English: This study introduces a semantic drift calibration method for class-incremental learning, addressing feature distribution gaps through mean shift compensation and covariance calibration to enhance stability and generalization across tasks.

Authors:Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher
Title: Training Deep Learning Models with Norm-Constrained LMOs
Abstract:
In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision. The code is available at https://github.com/LIONS-EPFL/scion .
中文: 本文提出了Scion随机优化算法,利用线性最小化预言机适应问题几何结构并适用于无约束问题,实验显示其能加速nanoGPT训练,具有内存高效性和超参数跨模型尺寸可迁移性。
English: This paper introduces Scion, a stochastic optimization algorithm that uses a linear minimization oracle to adapt to problem geometry and applies to unconstrained problems, demonstrating faster nanoGPT training with memory efficiency and hyperparameter transferability across model sizes.

Authors:Viacheslav Vasilev, Julia Agafonova, Nikolai Gerasimenko, Alexander Kapitanov, Polina Mikhailova, Evelina Mironova, Denis Dimitrov
Title: RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation
Abstract:
Text-to-image generation models have gained popularity among users around the world. However, many of these models exhibit a strong bias toward English-speaking cultures, ignoring or misrepresenting the unique characteristics of other language groups, countries, and nationalities. The lack of cultural awareness can reduce the generation quality and lead to undesirable consequences such as unintentional insult, and the spread of prejudice. In contrast to the field of natural language processing, cultural awareness in computer vision has not been explored as extensively. In this paper, we strive to reduce this gap. We propose a RusCode benchmark for evaluating the quality of text-to-image generation containing elements of the Russian cultural code. To do this, we form a list of 19 categories that best represent the features of Russian visual culture. Our final dataset consists of 1250 text prompts in Russian and their translations into English. The prompts cover a wide range of topics, including complex concepts from art, popular culture, folk traditions, famous people's names, natural objects, scientific achievements, etc. We present the results of a human evaluation of the side-by-side comparison of Russian visual concepts representations using popular generative models.
中文:文本到图像生成模型常偏向英语文化,因此我们提出RusCode基准,通过涵盖俄罗斯文化特征的人类评估提示来提升其文化表现力。
English: Text-to-image models often exhibit cultural bias favoring English-speaking contexts, prompting the development of the RusCode benchmark to evaluate and improve their representation of Russian cultural elements through human-assessed prompts.

Authors:Duong Anh Kiet
Title: Hierarchical Document Parsing via Large Margin Feature Matching and Heuristics
Abstract:
We present our solution to the AAAI-25 VRD-IU challenge, achieving first place in the competition. Our approach integrates large margin loss for improved feature discrimination and employs heuristic rules to refine hierarchical relationships. By combining a deep learning-based matching strategy with greedy algorithms, we achieve a significant boost in accuracy while maintaining computational efficiency. Our method attains an accuracy of 0.98904 on the private leaderboard, demonstrating its effectiveness in document structure parsing. Source codes are publicly available at https://github.com/ffyyytt/VRUID-AAAI-DAKiet
我们的方案在AAAI-25 VRD-IU挑战赛中夺冠,通过结合大间隔损失提升特征区分度和启发式规则优化层级关系,以0.98904的准确率实现了高效计算。
Our solution won the AAAI-25 VRD-IU challenge by integrating large margin loss for better feature discrimination and heuristic rules to refine hierarchical relationships, achieving 0.98904 accuracy with efficient computation.

Authors:Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, Angela Yao
Title: EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
Abstract:
We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance. Our dataset is released at: https://github.com/zhousheng97/EgoTextVQA.
中文: EgoTextVQA是一个包含1.5K视频和7K问题的新型基准,用于评估以自我为中心的视觉文本问答,目前最佳模型(Gemini 1.5 Pro)准确率仅33%,表明需要提升时间定位和多帧推理能力。
English: EgoTextVQA is a new benchmark with 1.5K videos and 7K questions for evaluating egocentric scene-text QA, where current models like Gemini 1.5 Pro achieve only 33% accuracy, highlighting the need for improved temporal grounding and multi-frame reasoning.

Authors:Rundong Liu, Andre Frade, Amal Vaidya, Maxime Labonne, Marcus Kaiser, Bismayan Chakrabarti, Jonathan Budd, Sean Moran
Title: On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o
Abstract:
This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.
中文: CodeQUEST 是一个利用大语言模型迭代评估和提升代码质量的框架,通过实验证明其在多个维度上显著改善代码质量,并与现有指标高度相关。
English: CodeQUEST is a framework using LLMs to iteratively evaluate and enhance code quality across multiple dimensions, demonstrating significant improvements and strong correlation with established metrics through experiments.

Authors:Jingjie Zhang, Hanqun Cao, Zijun Gao, Xiaorui Wang, Chunbin Gu
Title: SAGEPhos: Sage Bio-Coupled and Augmented Fusion for Phosphorylation Site Detection
Abstract:
Phosphorylation site prediction based on kinase-substrate interaction plays a vital role in understanding cellular signaling pathways and disease mechanisms. Computational methods for this task can be categorized into kinase-family-focused and individual kinase-targeted approaches. Individual kinase-targeted methods have gained prominence for their ability to explore a broader protein space and provide more precise target information for kinase inhibitors. However, most existing individual kinase-based approaches focus solely on sequence inputs, neglecting crucial structural information. To address this limitation, we introduce SAGEPhos (Structure-aware kinAse-substrate bio-coupled and bio-auGmented nEtwork for Phosphorylation site prediction), a novel framework that modifies the semantic space of main protein inputs using auxiliary inputs at two distinct modality levels. At the inter-modality level, SAGEPhos introduces a Bio-Coupled Modal Fusion method, distilling essential kinase sequence information to refine task-oriented local substrate feature space, creating a shared semantic space that captures crucial kinase-substrate interaction patterns. Within the substrate's intra-modality domain, it focuses on Bio-Augmented Fusion, emphasizing 2D local sequence information while selectively incorporating 3D spatial information from predicted structures to complement the sequence space. Moreover, to address the lack of structural information in current datasets, we contribute a new, refined phosphorylation site prediction dataset, which incorporates crucial structural elements and will serve as a new benchmark for the field. Experimental results demonstrate that SAGEPhos significantly outperforms baseline methods. We release the SAGEPhos models and code at https://github.com/ZhangJJ26/SAGEPhos.
中文: SAGEPhos是一种新型结构感知框架,通过生物耦合和生物增强融合方法整合激酶序列与底物结构信息,显著提升了磷酸化位点预测性能,并提供了新的基准数据集。
English: SAGEPhos is a novel structure-aware framework that enhances phosphorylation site prediction by integrating kinase sequence and substrate structural information through bio-coupled and bio-augmented fusion methods, outperforming existing approaches and introducing a new benchmark dataset.

Authors:Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
Title: LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Abstract:
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.
中文: 大型推理模型通过数据高效的监督微调即可有效学习复杂推理,其中思维链的结构对性能至关重要,而内容变化影响甚微。
English: Large reasoning models can effectively learn complex reasoning through efficient supervised fine-tuning with minimal data, where the structure of the chain-of-thought is crucial for performance, while content variations have little impact.

Authors:Yuxu Lu, Ai Chen, Dong Yang, Ryan Wen Liu
Title: USRNet: Unified Scene Recovery Network for Enhancing Traffic Imaging under Multiple Adverse Weather Conditions
Abstract:
Advancements in computer vision technology have facilitated the extensive deployment of intelligent transportation systems and visual surveillance systems across various applications, including autonomous driving, public safety, and environmental monitoring. However, adverse weather conditions such as haze, rain, snow, and more complex mixed degradation can significantly degrade image quality. The degradation compromises the accuracy and reliability of these systems across various scenarios. To tackle the challenge of developing adaptable models for scene restoration, we introduce the unified scene recovery network (USRNet), capable of handling multiple types of image degradation. The USRNet features a sophisticated architecture consisting of a scene encoder, an attention-driven node independent learning mechanism (NILM), an edge decoder, and a scene restoration module. The scene encoder, powered by advanced residual blocks, extracts deep features from degraded images in a progressive manner, ensuring thorough encoding of degradation information. To enhance the USRNet's adaptability in diverse weather conditions, we introduce NILM, which enables the network to learn and respond to different scenarios with precision, thereby increasing its robustness. The edge decoder is designed to extract edge features with precision, which is essential for maintaining image sharpness. Experimental results demonstrate that USRNet surpasses existing methods in handling complex imaging degradations, thereby improving the accuracy and reliability of visual systems across diverse scenarios. The code resources for this work can be accessed in https://github.com/LouisYxLu/USRNet.
中文: USRNet模型通过创新的架构有效应对恶劣天气导致的多种图像退化问题,显著提升了视觉系统在不同场景下的可靠性。
English: The USRNet model effectively addresses multiple image degradations caused by adverse weather through its innovative architecture, enhancing the reliability of visual systems across diverse scenarios.

Authors:Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen
Title: LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
Abstract:
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines. Our code is available at https://github.com/RUCAIBox/LongReD.
中文摘要:LongReD方法通过恢复蒸馏技术解决大语言模型扩展上下文窗口时出现的分布漂移和灾难性遗忘问题,有效缓解了短文本任务性能下降,同时保持长文本处理能力。
English Summary: The LongReD method is introduced to counteract the performance decline of large language models on short-text tasks when their context windows are expanded, by addressing distribution drift and catastrophic forgetting through restoration distillation techniques.

Authors:Jusheng Zhang, Zimeng Huang, Yijia Fan, Ningyuan Liu, Mingyan Li, Zhuojie Yang, Jiawei Yao, Jian Wang, Keze Wang
Title: KABB: Knowledge-Aware Bayesian Bandits for Dynamic Expert Coordination in Multi-Agent Systems
Abstract:
As scaling large language models faces prohibitive costs, multi-agent systems emerge as a promising alternative, though challenged by static knowledge assumptions and coordination inefficiencies. We introduces Knowledge-Aware Bayesian Bandits (KABB), a novel framework that enhances multi-agent system coordination through semantic understanding and dynamic adaptation. The framework features three key innovations: a three-dimensional knowledge distance model for deep semantic understanding, a dual-adaptation mechanism for continuous expert optimization, and a knowledge-aware Thompson Sampling strategy for efficient expert selection. Extensive evaluation demonstrates KABB achieves an optimal cost-performance balance, maintaining high performance while keeping computational demands relatively low in multi-agent coordination.
中文: KABB框架通过语义知识建模与动态优化机制,在多智能体协调中实现了高性能与低计算成本的平衡。
English: KABB is a novel multi-agent coordination framework that leverages semantic knowledge modeling and dynamic adaptation to achieve high performance with low computational costs.

Authors:Zilu Dong, Xiangqing Shen, Rui Xia
Title: MEMIT-Merge: Addressing MEMIT's Key-Value Conflicts in Same-Subject Batch Editing for LLMs
Abstract:
As large language models continue to scale up, knowledge editing techniques that modify models' internal knowledge without full retraining have gained significant attention. MEMIT, a prominent batch editing algorithm, stands out for its capability to perform mass knowledge modifications. However, we uncover that MEMIT's editing efficacy significantly deteriorates when processing batches containing multiple edits sharing the same subject. Our analysis reveals this stems from MEMIT's key value modeling framework: identical keys (derived from the shared subject) are forced to represent different values (corresponding to different knowledge), resulting in update conflicts during editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach that merges value computation processes for facts sharing the same subject, effectively resolving the performance degradation in samesubject batch editing scenarios. Experimental results demonstrate that when MEMIT's edit success rate drops to around 50% at larger batch sizes, MEMIT-Merge maintains a success rate exceeding 90%, showcasing remarkable robustness to subject entity collisions. The code is available at https://github.com/NUSTM/ MEMIT-Merge.
中文: MEMIT算法在处理同主体批量知识编辑时,因相同键值对应不同知识导致更新冲突,性能显著下降至约50%成功率;而提出的MEMIT-Merge方法通过合并同主体事实的值计算过程,将成功率稳定保持在90%以上,有效解决了该问题。
English: MEMIT, a batch knowledge editing method for large language models, suffers performance degradation when handling multiple edits with the same subject due to conflicting key-value updates, but the proposed MEMIT-Merge enhancement resolves this by merging value computations, maintaining over 90% success rate versus MEMIT's 50% drop.

Authors:Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He
Title: CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
Abstract:
Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.
中文摘要:CodeI/O方法通过将代码转化为自然语言的输入输出预测,使语言模型学习通用推理模式,从而在多种推理任务中实现性能提升。
English Summary: The CodeI/O method enhances reasoning in language models by transforming code into natural language input-output predictions, exposing universal reasoning patterns and improving performance across diverse tasks.

Authors:Xiaopeng Ye, Chen Xu, Zhongxiang Sun, Jun Xu, Gang Wang, Zhenhua Dong, Ji-Rong Wen
Title: CreAgent: Towards Long-Term Evaluation of Recommender System under Platform-Creator Information Asymmetry
Abstract:
Ensuring the long-term sustainability of recommender systems (RS) emerges as a crucial issue. Traditional offline evaluation methods for RS typically focus on immediate user feedback, such as clicks, but they often neglect the long-term impact of content creators. On real-world content platforms, creators can strategically produce and upload new items based on user feedback and preference trends. While previous studies have attempted to model creator behavior, they often overlook the role of information asymmetry. This asymmetry arises because creators primarily have access to feedback on the items they produce, while platforms possess data on the entire spectrum of user feedback. Current RS simulators, however, fail to account for this asymmetry, leading to inaccurate long-term evaluations. To address this gap, we propose CreAgent, a Large Language Model (LLM)-empowered creator simulation agent. By incorporating game theory's belief mechanism and the fast-and-slow thinking framework, CreAgent effectively simulates creator behavior under conditions of information asymmetry. Additionally, we enhance CreAgent's simulation ability by fine-tuning it using Proximal Policy Optimization (PPO). Our credibility validation experiments show that CreAgent aligns well with the behaviors between real-world platform and creator, thus improving the reliability of long-term RS evaluations. Moreover, through the simulation of RS involving CreAgents, we can explore how fairness- and diversity-aware RS algorithms contribute to better long-term performance for various stakeholders. CreAgent and the simulation platform are publicly available at https://github.com/shawnye2000/CreAgent.
中文摘要:CreAgent是一种基于大语言模型的创作者模拟代理,通过结合博弈论信念机制和快慢思维框架,有效解决了信息不对称下创作者行为的模拟问题,提升了推荐系统长期评估的可信度。
English Summary: CreAgent is an LLM-based simulation agent that addresses the limitations of current recommender system evaluations by modeling creator behavior under information asymmetry, enhancing long-term assessment reliability.

Authors:Chengkai Liu, Yangtian Zhang, Jianling Wang, Rex Ying, James Caverlee
Title: Flow Matching for Collaborative Filtering
Abstract:
Generative models have shown great promise in collaborative filtering by capturing the underlying distribution of user interests and preferences. However, existing approaches struggle with inaccurate posterior approximations and misalignment with the discrete nature of recommendation data, limiting their expressiveness and real-world performance. To address these limitations, we propose FlowCF, a novel flow-based recommendation system leveraging flow matching for collaborative filtering. We tailor flow matching to the unique challenges in recommendation through two key innovations: (1) a behavior-guided prior that aligns with user behavior patterns to handle the sparse and heterogeneous user-item interactions, and (2) a discrete flow framework to preserve the binary nature of implicit feedback while maintaining the benefits of flow matching, such as stable training and efficient inference. Extensive experiments demonstrate that FlowCF achieves state-of-the-art recommendation accuracy across various datasets with the fastest inference speed, making it a compelling approach for real-world recommender systems. The code is available at https://github.com/chengkai-liu/FlowCF.
中文:提出的FlowCF系统通过行为引导先验和离散流框架,克服了现有生成模型在协同过滤中的局限性,实现了最优推荐精度和最快推理速度,适用于实际推荐系统。
English: The proposed FlowCF system overcomes limitations of existing generative models in collaborative filtering by introducing a behavior-guided prior and discrete flow framework, achieving state-of-the-art accuracy and fastest inference speed for real-world recommendations.

Authors:Ruining Deng, Yihe Yang, David J. Pisapia, Benjamin Liechty, Junchao Zhu, Juming Xiong, Junlin Guo, Zhengyi Lu, Jiacheng Wang, Xing Yao, Runxuan Yu, Rendong Zhang, Gaurav Rudravaram, Mengmeng Yin, Pinaki Sarder, Haichun Yang, Yuankai Huo, Mert R. Sabuncu
Title: CASC-AI: Consensus-aware Self-corrective Learning for Noise Cell Segmentation
Abstract:
Multi-class cell segmentation in high-resolution gigapixel whole slide images (WSIs) is crucial for various clinical applications. However, training such models typically requires labor-intensive, pixel-wise annotations by domain experts. Recent efforts have democratized this process by involving lay annotators without medical expertise. However, conventional non-corrective approaches struggle to handle annotation noise adaptively because they lack mechanisms to mitigate false positives (FP) and false negatives (FN) at both the image-feature and pixel levels. In this paper, we propose a consensus-aware self-corrective AI agent that leverages the Consensus Matrix to guide its learning process. The Consensus Matrix defines regions where both the AI and annotators agree on cell and non-cell annotations, which are prioritized with stronger supervision. Conversely, areas of disagreement are adaptively weighted based on their feature similarity to high-confidence consensus regions, with more similar regions receiving greater attention. Additionally, contrastive learning is employed to separate features of noisy regions from those of reliable consensus regions by maximizing their dissimilarity. This paradigm enables the model to iteratively refine noisy labels, enhancing its robustness. Validated on one real-world lay-annotated cell dataset and two reasoning-guided simulated noisy datasets, our method demonstrates improved segmentation performance, effectively correcting FP and FN errors and showcasing its potential for training robust models on noisy datasets. The official implementation and cell annotations are publicly available at https://github.com/ddrrnn123/CASC-AI.
中文: 本文提出了一种基于共识矩阵的自校正AI代理,通过对比学习和自适应加权机制迭代修正非专业标注者产生的细胞标注噪声,在多个数据集上验证了其能有效改善分割性能并纠正误报和漏报错误。
English: This paper introduces a consensus-aware self-corrective AI agent that leverages a Consensus Matrix and contrastive learning to adaptively refine noisy cell annotations from non-expert annotators, demonstrating improved segmentation performance by effectively correcting false positives and negatives across multiple datasets.

Authors:Yelin Chen, Fanjin Zhang, Jie Tang
Title: Small Language Model Makes an Effective Long Text Extractor
Abstract:
Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: span-based methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective fine-tuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner. Code: https://github.com/THUDM/scholar-profiling/tree/main/sener
中文: 本文提出轻量级跨度命名实体识别方法SeNER,通过创新的注意力机制有效处理长文本中的实体跨度,在实现最先进抽取精度的同时保持GPU内存友好性。
English: This paper introduces SeNER, a lightweight span-based NER method that effectively handles long entity spans in extended texts through innovative attention mechanisms, achieving state-of-the-art accuracy while being GPU-memory-efficient.

Authors:Yuechen Xie, Jie Song, Mengqi Xue, Haofei Zhang, Xingen Wang, Bingde Hu, Genlang Chen, Mingli Song
Title: Dataset Ownership Verification in Contrastive Pre-trained Models
Abstract:
High-quality open-source datasets, which necessitate substantial efforts for curation, has become the primary catalyst for the swift progress of deep learning. Concurrently, protecting these datasets is paramount for the well-being of the data owner. Dataset ownership verification emerges as a crucial method in this domain, but existing approaches are often limited to supervised models and cannot be directly extended to increasingly popular unsupervised pre-trained models. In this work, we propose the first dataset ownership verification method tailored specifically for self-supervised pre-trained models by contrastive learning. Its primary objective is to ascertain whether a suspicious black-box backbone has been pre-trained on a specific unlabeled dataset, aiding dataset owners in upholding their rights. The proposed approach is motivated by our empirical insights that when models are trained with the target dataset, the unary and binary instance relationships within the embedding space exhibit significant variations compared to models trained without the target dataset. We validate the efficacy of this approach across multiple contrastive pre-trained models including SimCLR, BYOL, SimSiam, MOCO v3, and DINO. The results demonstrate that our method rejects the null hypothesis with a $p$-value markedly below $0.05$, surpassing all previous methodologies. Our code is available at https://github.com/xieyc99/DOV4CL.
中文: 本研究首次提出针对自监督预训练模型的数据集所有权验证方法,通过对比学习分析嵌入空间中的实例关系来准确判断模型是否使用特定数据集训练,并在多个模型中验证了其显著有效性。
English: This study introduces the first dataset ownership verification method for self-supervised pre-trained models using contrastive learning, effectively determining if a model was trained on a specific dataset by analyzing embedding space relationships, with validation across multiple models showing significant results.

Authors:Wei Wu, Qiuyi Li, Mingyang Li, Kun Fu, Fuli Feng, Jieping Ye, Hui Xiong, Zheng Wang
Title: GENERator: A Long-Context Generative Genomic Foundation Model
Abstract:
Advancements in DNA sequencing technologies have significantly improved our ability to decode genomic sequences. However, the prediction and interpretation of these sequences remain challenging due to the intricate nature of genetic material. Large language models (LLMs) have introduced new opportunities for biological sequence analysis. Recent developments in genomic language models have underscored the potential of LLMs in deciphering DNA sequences. Nonetheless, existing models often face limitations in robustness and application scope, primarily due to constraints in model structure and training data scale. To address these limitations, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences that translate into proteins structurally analogous to known families. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles. These capabilities position the GENERator as a pivotal tool for genomic research and biotechnological advancement, enhancing our ability to interpret and predict complex biological systems and enabling precise genomic interventions. Implementation details and supplementary resources are available at https://github.com/GenerTeam/GENERator.
中文: GENERator是一个拥有12亿参数和98千碱基对上下文长度的生成式基因组基础模型,基于3860亿碱基对真核DNA训练,在生成蛋白质编码序列和优化具有特定活性增强子序列方面表现卓越,为基因组研究和生物技术提供了关键工具。
English: The GENERator is a generative genomic foundation model with 1.2B parameters and a 98k bp context length, trained on 386B bp of eukaryotic DNA, achieving state-of-the-art performance in generating protein-coding sequences and optimizing enhancer sequences for genomic research and biotechnology.

Authors:Xuefeng Liu, Songhao Jiang, Siyu Chen, Zhuoran Yang, Yuxin Chen, Ian Foster, Rick Stevens
Title: DrugImproverGPT: A Large Language Model for Drug Optimization with Fine-Tuning via Structured Policy Optimization
Abstract:
Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug. This work is comprised of two primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. It includes a LLM designed for drug optimization and a novel Structured Policy Optimization (SPO) algorithm, which is theoretically grounded. This algorithm offers a unique perspective for fine-tuning the LLM-based generative model by aligning the improvement of the generated molecule with the input molecule under desired objectives. (2) A dataset of 1 million compounds, each with OEDOCK docking scores on 5 human proteins associated with cancer cells and 24 binding sites from SARS-CoV-2 virus. We conduct a comprehensive evaluation of SPO and demonstrate its effectiveness in improving the original drug across target properties. Our code and dataset will be publicly available at: https://github.com/xuefeng-cs/DrugImproverGPT.
中文摘要:本研究提出了一种新颖的强化学习算法——结构化策略优化(SPO),用于微调药物优化大语言模型,在提升目标特性的同时保持原始药物的有益化学性质。
English Summary: This research introduces a reinforcement learning algorithm called Structured Policy Optimization (SPO) to fine-tune a drug optimization LLM, improving target properties while preserving beneficial chemical characteristics of original drugs.

Authors:Shaokui Wei, Shanchao Yang, Jiayin Liu, Hongyuan Zha
Title: Revisiting the Auxiliary Data in Backdoor Purification
Abstract:
Backdoor attacks occur when an attacker subtly manipulates machine learning models during the training phase, leading to unintended behaviors when specific triggers are present. To mitigate such emerging threats, a prevalent strategy is to cleanse the victim models by various backdoor purification techniques. Despite notable achievements, current state-of-the-art (SOTA) backdoor purification techniques usually rely on the availability of a small clean dataset, often referred to as auxiliary dataset. However, acquiring an ideal auxiliary dataset poses significant challenges in real-world applications. This study begins by assessing the SOTA backdoor purification techniques across different types of real-world auxiliary datasets. Our findings indicate that the purification effectiveness fluctuates significantly depending on the type of auxiliary dataset used. Specifically, a high-quality in-distribution auxiliary dataset is essential for effective purification, whereas datasets from varied or out-of-distribution sources significantly degrade the defensive performance. Based on this, we propose Guided Input Calibration (GIC), which aims to improve purification efficacy by employing a learnable transformation. Guided by the victim model itself, GIC aligns the characteristics of the auxiliary dataset with those of the original training set. Comprehensive experiments demonstrate that GIC can substantially enhance purification performance across diverse types of auxiliary datasets. The code and data will be available via https://github.com/shawkui/BackdoorBenchER.
中文: 后门攻击在训练阶段暗中操控机器学习模型,现有净化技术依赖辅助数据集但效果不稳,本研究提出引导输入校准(GIC),通过可学习变换使辅助数据与原始训练集特征对齐,显著提升了各类数据集的净化性能。
English: Backdoor attacks compromise machine learning models during training, and while current purification methods require clean auxiliary data, this study introduces Guided Input Calibration (GIC) to enhance effectiveness by aligning auxiliary data with the original training set, improving performance across diverse datasets.

Authors:Sen Peng, Mingyue Wang, Jianfei He, Jijia Yang, Xiaohua Jia
Title: CAT: Contrastive Adversarial Training for Evaluating the Robustness of Protective Perturbations in Latent Diffusion Models
Abstract:
Latent diffusion models have recently demonstrated superior capabilities in many downstream image synthesis tasks. However, customization of latent diffusion models using unauthorized data can severely compromise the privacy and intellectual property rights of data owners. Adversarial examples as protective perturbations have been developed to defend against unauthorized data usage by introducing imperceptible noise to customization samples, preventing diffusion models from effectively learning them. In this paper, we first reveal that the primary reason adversarial examples are effective as protective perturbations in latent diffusion models is the distortion of their latent representations, as demonstrated through qualitative and quantitative experiments. We then propose the Contrastive Adversarial Training (CAT) utilizing lightweight adapters as an adaptive attack against these protection methods, highlighting their lack of robustness. Extensive experiments demonstrate that our CAT method significantly reduces the effectiveness of protective perturbations in customization, urging the community to reconsider and improve the robustness of existing protective perturbations. The code is available at https://github.com/senp98/CAT.
Chinese: 潜在扩散模型易受对抗样本干扰,因其扭曲潜在表示,但提出的对比对抗训练(CAT)方法有效削弱了这些保护性扰动的防御效果,揭示了其鲁棒性不足。
English: Latent diffusion models are vulnerable to adversarial examples that distort latent representations, but the proposed Contrastive Adversarial Training (CAT) method effectively counteracts these protective perturbations, exposing their lack of robustness.

Authors:Elias Lumer, Pradeep Honaganahalli Basavaraju, Myles Mason, James A. Burke, Vamse Kumar Subbiah
Title: Graph RAG-Tool Fusion
Abstract:
Recent developments in retrieval-augmented generation (RAG) for selecting relevant tools from a tool knowledge base enable LLM agents to scale their complex tool calling capabilities to hundreds or thousands of external tools, APIs, or agents-as-tools. However, traditional RAG-based tool retrieval fails to capture structured dependencies between tools, limiting the retrieval accuracy of a retrieved tool's dependencies. For example, among a vector database of tools, a "get stock price" API requires a "stock ticker" parameter from a "get stock ticker" API, and both depend on OS-level internet connectivity tools. In this paper, we address this limitation by introducing Graph RAG-Tool Fusion, a novel plug-and-play approach that combines the strengths of vector-based retrieval with efficient graph traversal to capture all relevant tools (nodes) along with any nested dependencies (edges) within the predefined tool knowledge graph. We also present ToolLinkOS, a new tool selection benchmark of 573 fictional tools, spanning over 15 industries, each with an average of 6.3 tool dependencies. We demonstrate that Graph RAG-Tool Fusion achieves absolute improvements of 71.7% and 22.1% over naïve RAG on ToolLinkOS and ToolSandbox benchmarks, respectively (mAP@10). ToolLinkOS dataset is available at https://github.com/EliasLumer/Graph-RAG-Tool-Fusion-ToolLinkOS
中文: 本文提出Graph RAG-Tool Fusion方法,通过结合向量检索与图遍历技术来捕捉工具间依赖关系,在新型基准测试上相比传统RAG实现了显著性能提升。
English: This paper introduces Graph RAG-Tool Fusion, a plug-and-play method that enhances tool retrieval by combining vector-based search with graph traversal to capture tool dependencies, achieving significant improvements over traditional RAG on new benchmarks.

Authors:Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang
Title: Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion
Abstract:
Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate not only outperforms existing state-of-the-art methods in terms of video quality, but also exhibits strong competitiveness in lip synchronization while offering improved flexibility in controlling emotion and head pose. The code will be available at https://github.com/Playmate111/Playmate.
Chinese: 提出的Playmate框架采用两阶段训练方法,通过解耦的3D表征和情感控制模块,能够生成具有精确唇部同步和精细情感控制的逼真说话人脸,在视频质量和控制灵活性方面优于现有方法。
English: The proposed Playmate framework introduces a two-stage training approach with decoupled 3D representation and emotion-control modules to generate lifelike talking faces with accurate lip-sync and fine-grained emotional control, outperforming existing methods in video quality and flexibility.

Authors:Ravi Shah, Atsushi Fukuda, Quan Huu Cap
Title: Color-Quality Invariance for Robust Medical Image Segmentation
Abstract:
Single-source domain generalization (SDG) in medical image segmentation remains a significant challenge, particularly for images with varying color distributions and qualities. Previous approaches often struggle when models trained on high-quality images fail to generalize to low-quality test images due to these color and quality shifts. In this work, we propose two novel techniques to enhance generalization: dynamic color image normalization (DCIN) module and color-quality generalization (CQG) loss. The DCIN dynamically normalizes the color of test images using two reference image selection strategies. Specifically, the DCIN utilizes a global reference image selection (GRIS), which finds a universal reference image, and a local reference image selection (LRIS), which selects a semantically similar reference image per test sample. Additionally, CQG loss enforces invariance to color and quality variations by ensuring consistent segmentation predictions across transformed image pairs. Experimental results show that our proposals significantly improve segmentation performance over the baseline on two target domain datasets, despite being trained solely on a single source domain. Notably, our model achieved up to a 32.3-point increase in Dice score compared to the baseline, consistently producing robust and usable results even under substantial domain shifts. Our work contributes to the development of more robust medical image segmentation models that generalize across unseen domains. The implementation code is available at https://github.com/RaviShah1/DCIN-CQG.
中文摘要:本研究提出了动态颜色图像归一化和颜色质量泛化损失两种新技术,显著提升了医学图像分割中单源域泛化能力,在面对颜色和质量变化的未知域时表现出优越性能。
English Summary: This study introduces two novel techniques, dynamic color image normalization and color-quality generalization loss, to enhance single-source domain generalization in medical image segmentation, significantly improving performance on unseen domains with varying color and quality.

Authors:Fan Liu, Wenshuo Chao, Naiqiang Tan, Hao Liu
Title: Bag of Tricks for Inference-time Computation of LLM Reasoning
Abstract:
With the advancement of large language models (LLMs), solving complex reasoning tasks has gained increasing attention. Inference-time computation methods (e.g., Best-of-N, beam search, et al.) are particularly valuable as they can enhance reasoning performance without modifying model parameters or requiring additional training. However, these techniques come with implementation challenges, and most existing methods remain at the proof-of-concept stage with limited practical adoption due to their computational complexity and varying effectiveness across different tasks. In this paper, we investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity. Since most current methods rely on a proposer-verifier pipeline that first generates candidate solutions (e.g., reasoning solutions) and then selects the best one based on reward signals (e.g., RLHF rewards, process rewards), our research focuses on optimizing both candidate solution generation (e.g., instructing prompts, hyperparameters such as temperature and top-p) and reward mechanisms (e.g., self-evaluation, reward types). Through extensive experiments (more than 20,000 A100-80G GPU hours with over 1,000 experiments) across a variety of models (e.g., Llama, Qwen, and Mistral families) of various sizes, our ablation studies reveal that previously overlooked strategies can significantly enhance performance (e.g., tuning temperature can improve reasoning task performance by up to 5%). Furthermore, we establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks. These findings provide a stronger foundation for future research. The code is available at https://github.com/usail-hkust/benchmark_inference_time_computation_LLM
大语言模型通过无需重新训练的推理时计算方法提升复杂推理能力,我们的大规模基准测试发现温度调节等优化策略最高可提升性能5%,为未来研究建立了标准化评估体系。
Large language models benefit from inference-time computation methods that enhance reasoning without retraining, and our extensive benchmarking reveals optimized strategies like temperature tuning can boost performance by up to 5%, establishing a standardized evaluation framework for future research.

Authors:ByungOk Han, Woo-han Yun, Beom-Su Seo, Jaehong Kim
Title: Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired
Abstract:
Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully open-sourced the SAIT dataset and SA-Bench, along with the related code, at https://github.com/byungokhan/Space-awareVLM
中文摘要:本文提出的SAIT数据集和SA-Bench旨在提升视觉语言模型的空间感知能力,使导盲犬机器人能为视障人士提供更安全的导航指引,该模型在实验中表现优异且所有资源均已开源。
English Summary: The SAIT dataset and SA-Bench are introduced to enhance Vision-Language Models' spatial understanding for safer guide dog robots, with the proposed model outperforming existing methods and all resources being open-sourced.

Authors:Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
Title: Explaining 3D Computed Tomography Classifiers with Counterfactuals
Abstract:
Counterfactual explanations enhance the interpretability of deep learning models in medical imaging, yet adapting them to 3D CT scans poses challenges due to volumetric complexity and resource demands. We extend the Latent Shift counterfactual generation method from 2D applications to explain 3D computed tomography (CT) scans classifiers. We address the challenges associated with 3D classifiers, such as limited training samples and high memory demands, by implementing a slice-based autoencoder and gradient blocking except for specific chunks of slices. This method leverages a 2D encoder trained on CT slices, which are subsequently combined to maintain 3D context. We demonstrate this technique on two models for clinical phenotype prediction and lung segmentation. Our approach is both memory-efficient and effective for generating interpretable counterfactuals in high-resolution 3D medical imaging.
中文: 本研究通过采用基于切片的自动编码器和梯度阻断技术,将潜在偏移方法扩展至三维CT扫描的反事实解释生成,在保持临床可解释性的同时,有效解决了内存限制和体积复杂性的问题。
English: The study extends the Latent Shift method to generate counterfactual explanations for 3D CT scans by using a slice-based autoencoder and gradient blocking, effectively addressing memory constraints and volumetric complexity while maintaining interpretability in clinical applications.

Authors:Girish A. Koushik, Diptesh Kanojia, Helen Treharne
Title: Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content
Abstract:
Social media platforms enable the propagation of hateful content across different modalities such as textual, auditory, and visual, necessitating effective detection methods. While recent approaches have shown promise in handling individual modalities, their effectiveness across different modality combinations remains unexplored. This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content. Our comprehensive evaluation reveals significant modality-specific limitations: while simple embedding fusion achieves state-of-the-art performance on video content (HateMM dataset) with a 9.9% points F1-score improvement, it struggles with complex image-text relationships in memes (Hateful Memes dataset). Through detailed ablation studies and error analysis, we demonstrate how current fusion approaches fail to capture nuanced cross-modal interactions, particularly in cases involving benign confounders. Our findings provide crucial insights for developing more robust hate detection systems and highlight the need for modality-specific architectural considerations. The code is available at https://github.com/gak97/Video-vs-Meme-Hate.
Chinese: 本研究系统评估了基于融合的多模态仇恨内容检测方法,发现简单嵌入融合在视频内容上表现优异,但在处理表情包中复杂的图文关系时存在不足,因其难以捕捉细微的跨模态交互特征。
English: This study systematically evaluates fusion-based approaches for multimodal hate detection, revealing that while simple embedding fusion excels with video content, it struggles with complex image-text relationships in memes due to limitations in capturing nuanced cross-modal interactions.

Authors:Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
Title: Cardiverse: Harnessing LLMs for Novel Card Game Prototyping
Abstract:
The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and develop scalable gameplay AI for large-scale evaluations. This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework. The approach highlights a graph-based indexing method for generating novel game variations, an LLM-driven system for consistent game code generation validated by gameplay records, and a gameplay AI constructing method that uses an ensemble of LLM-generated heuristic functions optimized through self-play. These contributions aim to accelerate card game prototyping, reduce human labor, and lower barriers to entry for game developers. For code repo visit this http URL https://github.com/danruili/Cardiverse
中文: 本文提出了一种自动化卡牌游戏原型框架,通过基于图的索引和大型语言模型驱动的系统,生成新颖游戏机制、确保一致的游戏体验并开发可扩展的AI,从而减少人力投入并降低开发门槛。
English: This paper introduces an automated card game prototyping framework that uses graph-based indexing and LLM-driven systems to generate novel game mechanics, ensure consistent gameplay, and develop scalable AI, thereby reducing human effort and barriers for developers.

Authors:Ze Sheng, Zhicheng Chen, Shuning Gu, Heqing Huang, Guofei Gu, Jeff Huang
Title: LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights
Abstract:
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection, addressing critical challenges in the security domain. Traditional methods, such as static and dynamic analysis, often falter due to inefficiencies, high false positive rates, and the growing complexity of modern software systems. By leveraging their ability to analyze code structures, identify patterns, and generate repair suggestions, LLMs, exemplified by models like GPT, BERT, and CodeBERT, present a novel and scalable approach to mitigating vulnerabilities. This paper provides a detailed survey of LLMs in vulnerability detection. It examines key aspects, including model architectures, application methods, target languages, fine-tuning strategies, datasets, and evaluation metrics. We also analyze the scope of current research problems, highlighting the strengths and weaknesses of existing approaches. Further, we address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis. Based on these findings, we propose solutions for issues like dataset scalability, model interpretability, and applications in low-resource scenarios. Our contributions are threefold: (1) a systematic review of how LLMs are applied in vulnerability detection; (2) an analysis of shared patterns and differences across studies, with a unified framework for understanding the field; and (3) a summary of key challenges and future research directions. This work provides valuable insights for advancing LLM-based vulnerability detection. We also maintain and regularly update latest selected paper on https://github.com/OwenSanzas/LLM-For-Vulnerability-Detection
中文: 大语言模型通过分析代码结构和生成修复建议,为软件漏洞检测提供了变革性方法,克服了传统技术的局限性,但在可扩展性和可解释性方面仍面临挑战。
English: Large Language Models offer a transformative approach to software vulnerability detection by analyzing code structures and generating repair suggestions, addressing limitations of traditional methods while presenting challenges in scalability and interpretability.

Authors:Art Poon
Title: Building networks of shared research interests by embedding words into a representation space
Abstract:
Departments within a university are not only administrative units, but also an effort to gather investigators around common fields of academic study. A pervasive challenge is connecting members with shared research interests both within and between departments. Here I describe a workflow that adapts methods from natural language processing to generate a network connecting $n=79$ members of a university department, or multiple departments within a faculty ($n=278$), based on common topics in their research publications. After extracting and processing terms from $n=16,901$ abstracts in the PubMed database, the co-occurrence of terms is encoded in a sparse document-term matrix. Based on the angular distances between the presence-absence vectors for every pair of terms, I use the uniform manifold approximation and projection (UMAP) method to embed the terms into a representational space such that terms that tend to appear in the same documents are closer together. Each author's corpus defines a probability distribution over terms in this space. Using the Wasserstein distance to quantify the similarity between these distributions, I generate a distance matrix among authors that can be analyzed and visualized as a graph. I demonstrate that this nonparametric method produces clusters with distinct themes that are consistent with some academic divisions, while identifying untapped connections among members. A documented workflow comprising Python and R scripts is available under the MIT license at https://github.com/PoonLab/tragula.
中文: 本研究采用自然语言处理方法,通过分析论文摘要构建了大学院系成员的研究关联网络,既能识别与现有学术划分一致的聚类,又能发现潜在的跨领域合作机会。
English: This study introduces a computational workflow using natural language processing to map research connections among university faculty by analyzing publication abstracts, revealing both established clusters and novel interdisciplinary links.

Authors:Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David Mortensen
Title: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment
Abstract:
Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.
Chinese: MixGoP是一种利用高斯混合模型对音素分布进行多子簇建模的新方法,在多数数据集上实现了最优性能,并证明自监督语音模型特征比传统方法能更有效地捕捉音位变体差异。
English: MixGoP is a novel approach that uses Gaussian mixture models to model phoneme distributions with multiple subclusters, achieving state-of-the-art performance on most datasets and demonstrating that self-supervised speech model features capture allophonic variation more effectively than traditional methods.

Authors:Haoqi Wang, Tong Zhang, Mathieu Salzmann
Title: Demystifying Singular Defects in Large Language Models
Abstract:
Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs. Code is released at https://github.com/haoqiwang/singular_defect.
中文: 本研究从理论和实证角度揭示了大型语言模型中高范数令牌的成因机制,阐明了其与视觉Transformer的差异,并通过量化方案改进和模型签名设计展示了实际应用价值。
English: This study provides theoretical and empirical insights into the causes of high-norm tokens in large language models, revealing their distinct mechanisms from vision transformers and demonstrating practical applications in quantization improvement and model signature design.

Authors:Siddarth Venkatraman, Mohsin Hasan, Minsu Kim, Luca Scimeca, Marcin Sendera, Yoshua Bengio, Glen Berseth, Nikolay Malkin
Title: Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models
Abstract:
Any well-behaved generative model over a variable $\mathbf{x}$ can be expressed as a deterministic transformation of an exogenous ('outsourced') Gaussian noise variable $\mathbf{z}$: $\mathbf{x}=f_θ(\mathbf{z})$. In such a model (\eg, a VAE, GAN, or continuous-time flow-based model), sampling of the target variable $\mathbf{x} \sim p_θ(\mathbf{x})$ is straightforward, but sampling from a posterior distribution of the form $p(\mathbf{x}\mid\mathbf{y}) \propto p_θ(\mathbf{x})r(\mathbf{x},\mathbf{y})$, where $r$ is a constraint function depending on an auxiliary variable $\mathbf{y}$, is generally intractable. We propose to amortize the cost of sampling from such posterior distributions with diffusion models that sample a distribution in the noise space ($\mathbf{z}$). These diffusion samplers are trained by reinforcement learning algorithms to enforce that the transformed samples $f_θ(\mathbf{z})$ are distributed according to the posterior in the data space ($\mathbf{x}$). For many models and constraints, the posterior in noise space is smoother than in data space, making it more suitable for amortized inference. Our method enables conditional sampling under unconditional GAN, (H)VAE, and flow-based priors, comparing favorably with other inference methods. We demonstrate the proposed outsourced diffusion sampling in several experiments with large pretrained prior models: conditional image generation, reinforcement learning with human feedback, and protein structure generation.
中文: 本研究提出了一种基于扩散的方法,通过强化学习在噪声空间中训练采样器,将平滑化的后验分布转换至数据空间,从而实现对复杂生成模型的高效条件采样。
English: The study introduces a diffusion-based method that uses reinforcement learning to train samplers in the noise space, enabling efficient conditional sampling from complex generative models by transforming smoothed posterior distributions into the data space.

Authors:Behzad Hejrati, Soumyanil Banerjee, Carri Glide-Hurst, Ming Dong
Title: Conditional diffusion model with spatial attention and latent embedding for medical image segmentation
Abstract:
Diffusion models have been used extensively for high quality image and video generation tasks. In this paper, we propose a novel conditional diffusion model with spatial attention and latent embedding (cDAL) for medical image segmentation. In cDAL, a convolutional neural network (CNN) based discriminator is used at every time-step of the diffusion process to distinguish between the generated labels and the real ones. A spatial attention map is computed based on the features learned by the discriminator to help cDAL generate more accurate segmentation of discriminative regions in an input image. Additionally, we incorporated a random latent embedding into each layer of our model to significantly reduce the number of training and sampling time-steps, thereby making it much faster than other diffusion models for image segmentation. We applied cDAL on 3 publicly available medical image segmentation datasets (MoNuSeg, Chest X-ray and Hippocampus) and observed significant qualitative and quantitative improvements with higher Dice scores and mIoU over the state-of-the-art algorithms. The source code is publicly available at https://github.com/Hejrati/cDAL/.
中文: 本文提出了一种结合空间注意力和潜在嵌入的条件扩散模型cDAL,通过提升判别区域分割精度和大幅减少训练步骤,在三个公开医学图像数据集上实现了超越现有最佳算法的分割性能。
English: This paper introduces cDAL, a conditional diffusion model with spatial attention and latent embedding that enhances medical image segmentation by improving accuracy and reducing computational time, achieving superior performance on three public datasets.

Authors:Arghadip Das, Arnab Raha, Shamik Kundu, Soumendu Kumar Ghosh, Deepak Mathaikutty, Vijay Raghunathan
Title: XAMBA: Enabling Efficient State Space Models on Resource-Constrained Neural Processing Units
Abstract:
State-Space Models (SSMs) have emerged as efficient alternatives to transformers for sequential data tasks, offering linear or near-linear scalability with sequence length, making them ideal for long-sequence applications in NLP, vision, and edge AI, including real-time transcription, translation, and contextual search. These applications require lightweight, high-performance models for deployment on resource-constrained devices like laptops and PCs. Designing specialized accelerators for every emerging neural network is costly and impractical; instead, optimizing models for existing NPUs in AI PCs provides a scalable solution. To this end, we propose XAMBA, the first framework to enable and optimize SSMs on commercial off-the-shelf (COTS) state-of-the-art (SOTA) NPUs. XAMBA follows a three-step methodology: (1) enabling SSMs on NPUs, (2) optimizing performance to meet KPI requirements, and (3) trading accuracy for additional performance gains. After enabling SSMs on NPUs, XAMBA mitigates key bottlenecks using CumBA and ReduBA, replacing sequential CumSum and ReduceSum operations with matrix-based computations, significantly improving execution speed and memory efficiency. Additionally, ActiBA enhances performance by approximating expensive activation functions (e.g., Swish, Softplus) using piecewise linear mappings, reducing latency with minimal accuracy loss. Evaluations on an Intel Core Ultra Series 2 AI PC show that XAMBA achieves up to 4.8X speed-up over the baseline. Our implementation is available at https://github.com/arghadippurdue/XAMBA.
Chinese: XAMBA是首个在商用NPU上实现并优化状态空间模型的框架,通过解决计算瓶颈和近似激活函数,在AI PC上实现了高达4.8倍的加速。
English: XAMBA is the first framework that enables and optimizes State-Space Models (SSMs) on commercial NPUs by addressing computational bottlenecks and approximating activation functions, achieving up to 4.8X speed-up on AI PCs.

Authors:Songtao Huang, Zhen Zhao, Can Li, Lei Bai
Title: TimeKAN: KAN-based Frequency Decomposition Learning Architecture for Long-term Time Series Forecasting
Abstract:
Real-world time series often have multiple frequency components that are intertwined with each other, making accurate time series forecasting challenging. Decomposing the mixed frequency components into multiple single frequency components is a natural choice. However, the information density of patterns varies across different frequencies, and employing a uniform modeling approach for different frequency components can lead to inaccurate characterization. To address this challenges, inspired by the flexibility of the recent Kolmogorov-Arnold Network (KAN), we propose a KAN-based Frequency Decomposition Learning architecture (TimeKAN) to address the complex forecasting challenges caused by multiple frequency mixtures. Specifically, TimeKAN mainly consists of three components: Cascaded Frequency Decomposition (CFD) blocks, Multi-order KAN Representation Learning (M-KAN) blocks and Frequency Mixing blocks. CFD blocks adopt a bottom-up cascading approach to obtain series representations for each frequency band. Benefiting from the high flexibility of KAN, we design a novel M-KAN block to learn and represent specific temporal patterns within each frequency band. Finally, Frequency Mixing blocks is used to recombine the frequency bands into the original format. Extensive experimental results across multiple real-world time series datasets demonstrate that TimeKAN achieves state-of-the-art performance as an extremely lightweight architecture. Code is available at https://github.com/huangst21/TimeKAN.
中文摘要:TimeKAN是一种基于Kolmogorov-Arnold网络的轻量级架构,通过分解多频时序成分并学习其差异化模式,在多个真实数据集上实现了最优的预测性能。
English Summary: TimeKAN is a lightweight forecasting architecture that uses Kolmogorov-Arnold Networks to decompose mixed-frequency time series components and model their distinct patterns, achieving state-of-the-art performance across multiple real-world datasets.

Authors:Sina Tayebati, Divake Kumar, Nastaran Darabi, Dinithi Jayasuriya, Ranganath Krishnan, Amit Ranjan Trivedi
Title: Learning Conformal Abstention Policies for Adaptive Risk Management in Large Language and Vision-Language Models
Abstract:
Large Language and Vision-Language Models (LLMs/VLMs) are increasingly used in safety-critical applications, yet their opaque decision-making complicates risk assessment and reliability. Uncertainty quantification (UQ) helps assess prediction confidence and enables abstention when uncertainty is high. Conformal prediction (CP), a leading UQ method, provides statistical guarantees but relies on static thresholds, which fail to adapt to task complexity and evolving data distributions, leading to suboptimal trade-offs in accuracy, coverage, and informativeness. To address this, we propose learnable conformal abstention, integrating reinforcement learning (RL) with CP to optimize abstention thresholds dynamically. By treating CP thresholds as adaptive actions, our approach balances multiple objectives, minimizing prediction set size while maintaining reliable coverage. Extensive evaluations across diverse LLM/VLM benchmarks show our method outperforms Least Ambiguous Classifiers (LAC) and Adaptive Prediction Sets (APS), improving accuracy by up to 3.2%, boosting AUROC for hallucination detection by 22.19%, enhancing uncertainty-guided selective generation (AUARC) by 21.17%, and reducing calibration error by 70%-85%. These improvements hold across multiple models and datasets while consistently meeting the 90% coverage target, establishing our approach as a more effective and flexible solution for reliable decision-making in safety-critical applications. The code is available at: {https://github.com/sinatayebati/vlm-uncertainty}.
Chinese: 本文提出可学习的不确定性弃权方法,通过强化学习动态优化共形预测的弃权阈值,显著提升了大语言与视觉语言模型的不确定性量化性能,在多个基准测试中实现了精度提升、幻觉检测改进和校准误差降低,同时保持可靠的覆盖范围。
English: This paper introduces learnable conformal abstention, a reinforcement learning-based method that dynamically optimizes abstention thresholds in conformal prediction to improve uncertainty quantification for large language and vision-language models, achieving significant performance gains across multiple benchmarks while maintaining reliable coverage.

Authors:Wen Zhou, Shuichiro Miwa, Yang Liu, Koji Okamoto
Title: BF-GAN: Development of an AI-driven Bubbly Flow Image Generation Model Using Generative Adversarial Networks
Abstract:
A generative AI architecture called bubbly flow generative adversarial networks (BF-GAN) is developed, designed to generate realistic and high-quality bubbly flow images through physically conditioned inputs, jg and jf. Initially, 52 sets of bubbly flow experiments under varying conditions are conducted to collect 140,000 bubbly flow images with physical labels of jg and jf for training data. A multi-scale loss function is then developed, incorporating mismatch loss and pixel loss to enhance the generative performance of BF-GAN further. Regarding evaluative metrics of generative AI, the BF-GAN has surpassed conventional GAN. Physically, key parameters of bubbly flow generated by BF-GAN are extracted and compared with measurement values and empirical correlations, validating BF-GAN's generative performance. The comparative analysis demonstrate that the BF-GAN can generate realistic and high-quality bubbly flow images with any given jg and jf within the research scope. BF-GAN offers a generative AI solution for two-phase flow research, substantially lowering the time and cost required to obtain high-quality data. In addition, it can function as a benchmark dataset generator for bubbly flow detection and segmentation algorithms, enhancing overall productivity in this research domain. The BF-GAN model is available online (https://github.com/zhouzhouwen/BF-GAN).
中文: BF-GAN是一种生成对抗网络架构,通过物理条件输入生成高质量气泡流图像,其性能超越传统GAN并验证了关键参数,显著降低了研究时间和成本。
English: The BF-GAN is a generative AI architecture that produces realistic bubbly flow images using physical inputs, outperforming traditional GANs and validating key parameters against experimental data while reducing research time and costs.

Authors:Finnian Westenfelder, Erik Hemberg, Miguel Tulla, Stephen Moskal, Una-May O'Reilly, Silviu Chiricescu
Title: LLM-Supported Natural Language to Bash Translation
Abstract:
The Bourne-Again Shell (Bash) command-line interface for Linux systems has complex syntax and requires extensive specialized knowledge. Using the natural language to Bash command (NL2SH) translation capabilities of large language models (LLMs) for command composition circumvents these issues. However, the NL2SH performance of LLMs is difficult to assess due to inaccurate test data and unreliable heuristics for determining the functional equivalence of Bash commands. We present a manually verified test dataset of 600 instruction-command pairs and a training dataset of 40,939 pairs, increasing the size of previous datasets by 441% and 135%, respectively. Further, we present a novel functional equivalence heuristic that combines command execution with LLM evaluation of command outputs. Our heuristic can determine the functional equivalence of two Bash commands with 95% confidence, a 16% increase over previous heuristics. Evaluation of popular LLMs using our test dataset and heuristic demonstrates that parsing, in-context learning, in-weight learning, and constrained decoding can improve NL2SH accuracy by up to 32%. Our findings emphasize the importance of dataset quality, execution-based evaluation and translation method for advancing NL2SH translation. Our code is available at https://github.com/westenfelder/NL2SH
Large language models can translate natural language to Bash commands, but their performance is hard to evaluate due to poor test data and unreliable equivalence checks; this study introduces verified datasets and a new evaluation method that improves assessment confidence by 16% and translation accuracy by up to 32%.
English Summary:

Authors:Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Xinbing Liang, Fengwei Teng, Jinhao Tu, Fashen Ren, Xiangru Tang, Sirui Hong, Chenglin Wu, Yuyu Luo
Title: Self-Supervised Prompt Optimization
Abstract:
Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO.
Chinese Summary: 本文提出自监督提示优化框架(SPO),通过利用纯输出比较而无需外部参考,自主发现适用于封闭式和开放式任务的有效提示,以极低成本实现了优于现有方法的性能。
English Summary: The paper introduces Self-Supervised Prompt Optimization (SPO), a cost-effective framework that autonomously discovers effective prompts for both closed and open-ended tasks by leveraging pairwise output comparisons without requiring external references, achieving superior performance at significantly reduced costs.

Authors:Muhammed Öz, Nicholas Kiefer, Charlotte Debus, Jasmin Hörter, Achim Streit, Markus Götz
Title: Model Fusion via Neuron Transplantation
Abstract:
Ensemble learning is a widespread technique to improve the prediction performance of neural networks. However, it comes at the price of increased memory and inference time. In this work we propose a novel model fusion technique called \emph{Neuron Transplantation (NT)} in which we fuse an ensemble of models by transplanting important neurons from all ensemble members into the vacant space obtained by pruning insignificant neurons. An initial loss in performance post-transplantation can be quickly recovered via fine-tuning, consistently outperforming individual ensemble members of the same model capacity and architecture. Furthermore, NT enables all the ensemble members to be jointly pruned and jointly trained in a combined model. Comparing it to alignment-based averaging (like Optimal-Transport-fusion), it requires less fine-tuning than the corresponding OT-fused model, the fusion itself is faster and requires less memory, while the resulting model performance is comparable or better. The code is available under the following link: https://github.com/masterbaer/neuron-transplantation.
Chinese: 本研究提出了一种名为“神经元移植”的模型融合技术,通过将集成模型中重要神经元移植到修剪后的空缺位置,在减少内存占用和加速推理的同时,获得了与传统方法相当或更优的性能。
English: The study introduces Neuron Transplantation, a model fusion technique that integrates key neurons from ensemble members into a pruned model, achieving comparable or superior performance with less memory and faster inference than traditional methods.

Authors:Xu Zhang, Kaidi Xu, Ziqing Hu, Ren Wang
Title: Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach
Abstract:
Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks. However, their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications. This paper addresses the critical question of how to incorporate robustness into MoEs while maintaining high natural accuracy. We begin by analyzing the vulnerability of MoE components, finding that expert networks are notably more susceptible to adversarial attacks than the router. Based on this insight, we propose a targeted robust training technique that integrates a novel loss function to enhance the adversarial robustness of MoE, requiring only the robustification of one additional expert without compromising training or inference efficiency. Building on this, we introduce a dual-model strategy that linearly combines a standard MoE model with our robustified MoE model using a smoothing parameter. This approach allows for flexible control over the robustness-accuracy trade-off. We further provide theoretical foundations by deriving certified robustness bounds for both the single MoE and the dual-model. To push the boundaries of robustness and accuracy, we propose a novel joint training strategy JTDMoE for the dual-model. This joint training enhances both robustness and accuracy beyond what is achievable with separate models. Experimental results on CIFAR-10 and TinyImageNet datasets using ResNet18 and Vision Transformer (ViT) architectures demonstrate the effectiveness of our proposed methods. The code is publicly available at https://github.com/TIML-Group/Robust-MoE-Dual-Model.
Chinese: 本文针对专家混合模型提出了一种鲁棒训练技术和双模型策略,在保持高精度的同时增强了对抗鲁棒性,并在CIFAR-10和TinyImageNet数据集上进行了实验验证。
English: This paper introduces a robust training technique and a dual-model strategy for Mixture of Experts (MoE) to enhance adversarial robustness while maintaining high accuracy, with experimental validation on CIFAR-10 and TinyImageNet datasets.

Authors:Xingye Chen, Wei Feng, Zhenbang Du, Weizhen Wang, Yanyin Chen, Haohan Wang, Linkai Liu, Yaoyu Li, Jinyuan Zhao, Yu Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Zhangang Lin, Jingping Shao, Yuanjie Shao, Xinge You, Changxin Gao, Nong Sang
Title: CTR-Driven Advertising Image Generation with Multimodal Large Language Models
Abstract:
In web data, advertising images are crucial for capturing user attention and improving advertising effectiveness. Most existing methods generate background for products primarily focus on the aesthetic quality, which may fail to achieve satisfactory online performance. To address this limitation, we explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective. Firstly, we build targeted pre-training tasks, and leverage a large-scale e-commerce multimodal dataset to equip MLLMs with initial capabilities for advertising image generation tasks. To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL), which can jointly utilize multimodal features and accurately reflect user click preferences. Meanwhile, a product-centric preference optimization strategy is developed to ensure that the generated background content aligns with the product characteristics after fine-tuning, enhancing the overall relevance and effectiveness of the advertising images. Extensive experiments have demonstrated that our method achieves state-of-the-art performance in both online and offline metrics. Our code and pre-trained models are publicly available at: https://github.com/Chenguoz/CAIG.
中文: 本研究提出了一种利用多模态大语言模型生成广告图像的新方法,通过针对性预训练、结合奖励模型的强化学习以及以产品为中心的优化策略,显著提升了点击率,在线上线下指标中均实现了最优性能。
English: This study introduces a novel approach using Multimodal Large Language Models (MLLMs) to generate advertising images optimized for Click-Through Rate (CTR), employing targeted pre-training, reinforcement learning with a reward model, and product-centric optimization to achieve state-of-the-art performance in both online and offline metrics.

Authors:Peng Huang, Shu Hu, Bo Peng, Xun Gong, Penghang Yin, Hongtu Zhu, Xi Wu, Xin Wang
Title: Diffusion-empowered AutoPrompt MedSAM
Abstract:
MedSAM, a medical foundation model derived from the SAM architecture, has demonstrated notable success across diverse medical domains. However, its clinical application faces two major challenges: the dependency on labor-intensive manual prompt generation, which imposes a significant burden on clinicians, and the absence of semantic labeling in the generated segmentation masks for organs or lesions, limiting its practicality for non-expert users. To address these limitations, we propose AutoMedSAM, an end-to-end framework derived from SAM, designed to enhance usability and segmentation performance. AutoMedSAM retains MedSAM's image encoder and mask decoder structure while introducing a novel diffusion-based class prompt encoder. The diffusion-based encoder employs a dual-decoder structure to collaboratively generate prompt embeddings guided by sparse and dense prompt definitions. These embeddings enhance the model's ability to understand and process clinical imagery autonomously. With this encoder, AutoMedSAM leverages class prompts to embed semantic information into the model's predictions, transforming MedSAM's semi-automated pipeline into a fully automated workflow. Furthermore, AutoMedSAM employs an uncertainty-aware joint optimization strategy during training to effectively inherit MedSAM's pre-trained knowledge while improving generalization by integrating multiple loss functions. Experimental results across diverse datasets demonstrate that AutoMedSAM achieves superior performance while broadening its applicability to both clinical settings and non-expert users. Code is available at https://github.com/HP-ML/AutoPromptMedSAM.git.
中文摘要:AutoMedSAM是一种自动化框架,通过消除手动提示并为分割掩码添加语义标签,增强了MedSAM的性能和可用性,使其更适用于临床场景和非专业用户。
English Summary: AutoMedSAM is an automated framework that enhances MedSAM by eliminating manual prompts and adding semantic labels to segmentation masks, improving both performance and usability for clinical and non-expert applications.

Authors:Hui Shen, Jingxuan Zhang, Boning Xiong, Rui Hu, Shoufa Chen, Zhongwei Wan, Xin Wang, Yu Zhang, Zixuan Gong, Guangyin Bao, Chaofan Tao, Yongfeng Huang, Ye Yuan, Mi Zhang
Title: Efficient Diffusion Models: A Survey
Abstract:
Diffusion models have emerged as powerful generative models capable of producing high-quality contents such as images, videos, and audio, demonstrating their potential to revolutionize digital content creation. However, these capabilities come at the cost of their significant computational resources and lengthy generation time, underscoring the critical need to develop efficient techniques for practical deployment. In this survey, we provide a systematic and comprehensive review of research on efficient diffusion models. We organize the literature in a taxonomy consisting of three main categories, covering distinct yet interconnected efficient diffusion model topics from algorithm-level, system-level, and framework perspective, respectively. We have also created a GitHub repository where we organize the papers featured in this survey at https://github.com/AIoT-MLSys-Lab/Efficient-Diffusion-Model-Survey. We hope our survey can serve as a valuable resource to help researchers and practitioners gain a systematic understanding of efficient diffusion model research and inspire them to contribute to this important and exciting field.
Chinese: 扩散模型作为能生成高质量内容的强大工具,虽计算资源需求高且生成耗时,但本综述系统梳理了从算法、系统和框架角度提升其效率的研究,旨在推动实际应用并为该领域提供参考。
English: Diffusion models are powerful generative tools for creating high-quality content but require significant computational resources, prompting a systematic survey to review efficient techniques from algorithm, system, and framework perspectives to aid practical deployment.

Authors:Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, Xinlong Wang
Title: EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Abstract:
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability. Code is publicly available at: https://github.com/baaivision/EVE.
中文摘要:无编码器的视觉语言模型EVEv2.0通过分层模态关联和优化训练策略,在数据效率和视觉推理能力上表现出色,正快速缩小与基于编码器模型的性能差距。
English Summary: Encoder-free vision-language models like EVEv2.0 are closing performance gaps with encoder-based models through hierarchical modality association and optimized training strategies, demonstrating superior data efficiency and vision-reasoning capabilities.

Authors:Tianlang Chen, Charilaos Kanatsoulis, Jure Leskovec
Title: RelGNN: Composite Message Passing for Relational Deep Learning
Abstract:
Predictive tasks on relational databases are critical in real-world applications spanning e-commerce, healthcare, and social media. To address these tasks effectively, Relational Deep Learning (RDL) encodes relational data as graphs, enabling Graph Neural Networks (GNNs) to exploit relational structures for improved predictions. However, existing RDL methods often overlook the intrinsic structural properties of the graphs built from relational databases, leading to modeling inefficiencies, particularly in handling many-to-many relationships. Here we introduce RelGNN, a novel GNN framework specifically designed to leverage the unique structural characteristics of the graphs built from relational databases. At the core of our approach is the introduction of atomic routes, which are simple paths that enable direct single-hop interactions between the source and destination nodes. Building upon these atomic routes, RelGNN designs new composite message passing and graph attention mechanisms that reduce redundancy, highlight key signals, and enhance predictive accuracy. RelGNN is evaluated on 30 diverse real-world tasks from Relbench (Fey et al., 2024), and achieves state-of-the-art performance on the vast majority of tasks, with improvements of up to 25%. Code is available at https://github.com/snap-stanford/RelGNN.
中文: RelGNN通过引入原子路径和复合消息传递机制,优化了图神经网络在关系数据库中的结构利用,在多数实际任务中实现了最高性能,提升幅度高达25%。
English: RelGNN introduces atomic routes and composite message passing to enhance GNNs for relational databases, achieving state-of-the-art performance with up to 25% improvement on real-world tasks.

Authors:Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen
Title: Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Abstract:
Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnote{https://github.com/InternLM/OREAL}.
中文:本文提出了OREAL这一新型强化学习框架,利用二元结果奖励显著提升语言模型的数学推理能力,使小规模模型首次达到与大型模型相媲美的准确率。
English: This paper introduces OREAL, a novel reinforcement learning framework that uses binary outcome rewards to significantly enhance mathematical reasoning in language models, achieving state-of-the-art accuracy with smaller model sizes.

Authors:Yue Zhu, Haiwen Diao, Shang Gao, Long Chen, Huchuan Lu
Title: KARST: Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission for Visual Classification
Abstract:
Fine-tuning pre-trained vision models for specific tasks is a common practice in computer vision. However, this process becomes more expensive as models grow larger. Recently, parameter-efficient fine-tuning (PEFT) methods have emerged as a popular solution to improve training efficiency and reduce storage needs by tuning additional low-rank modules within pre-trained backbones. Despite their advantages, they struggle with limited representation capabilities and misalignment with pre-trained intermediate features. To address these issues, we introduce an innovative Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission (KARST) for various recognition tasks. Specifically, its multi-kernel design extends Kronecker projections horizontally and separates adaptation matrices into multiple complementary spaces, reducing parameter dependency and creating more compact subspaces. Besides, it incorporates extra learnable re-scaling factors to better align with pre-trained feature distributions, allowing for more flexible and balanced feature aggregation. Extensive experiments validate that our KARST outperforms other PEFT counterparts with a negligible inference cost due to its re-parameterization characteristics. Code is publicly available at: https://github.com/Lucenova/KARST.
Chinese: 提出的多核Kronecker适应与重缩放传输(KARST)方法通过扩展投影空间和特征对齐,以极低的推理成本实现了优于同类参数高效微调方法的性能。
English: The proposed Multi-Kernel Kronecker Adaptation with Re-Scaling Transmission (KARST) enhances parameter-efficient fine-tuning by expanding projection spaces and aligning features, achieving superior performance with minimal inference cost.

Authors:Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang
Title: ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
Abstract:
We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing more explainable reasoning structures than DeepSeek-R1 and o3-mini, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux
中文: ReasonFlux-32B模型通过分层推理和可扩展思维模板,在MATH基准测试中达到91.2%准确率,在AIME中解题率达56.7%,显著超越了OpenAI o1-preview和DeepSeek V3等先进模型。
English: The ReasonFlux-32B model introduces hierarchical reasoning with scalable thought templates, achieving state-of-the-art math performance by surpassing leading models like OpenAI o1-preview and DeepSeek V3 on benchmarks including MATH (91.2% accuracy) and AIME (56.7% problem-solving rate).

Authors:Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, Kaipeng Zhang
Title: SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement
Abstract:
In this paper, we explore a principal way to enhance the quality of widely pre-existing coarse masks, enabling them to serve as reliable training data for segmentation models to reduce the annotation cost. In contrast to prior refinement techniques that are tailored to specific models or tasks in a close-world manner, we propose SAMRefiner, a universal and efficient approach by adapting SAM to the mask refinement task. The core technique of our model is the noise-tolerant prompting scheme. Specifically, we introduce a multi-prompt excavation strategy to mine diverse input prompts for SAM (i.e., distance-guided points, context-aware elastic bounding boxes, and Gaussian-style masks) from initial coarse masks. These prompts can collaborate with each other to mitigate the effect of defects in coarse masks. In particular, considering the difficulty of SAM to handle the multi-object case in semantic segmentation, we introduce a split-then-merge (STM) pipeline. Additionally, we extend our method to SAMRefiner++ by introducing an additional IoU adaption step to further boost the performance of the generic SAMRefiner on the target dataset. This step is self-boosted and requires no additional annotation. The proposed framework is versatile and can flexibly cooperate with existing segmentation methods. We evaluate our mask framework on a wide range of benchmarks under different settings, demonstrating better accuracy and efficiency. SAMRefiner holds significant potential to expedite the evolution of refinement tools. Our code is available at https://github.com/linyq2117/SAMRefiner.
本文提出SAMRefiner,一种通用高效的方法,通过适配SAM模型进行掩码优化,利用抗噪提示策略将粗糙掩码转化为可靠的分割训练数据,从而降低标注成本。
This paper introduces SAMRefiner, a universal and efficient method that adapts the Segment Anything Model (SAM) to refine coarse masks into reliable training data for segmentation models, reducing annotation costs through noise-tolerant prompting and a split-then-merge pipeline.

Authors:Daouda Sow, Herbert Woisetschläger, Saikiran Bulusu, Shiqiang Wang, Hans-Arno Jacobsen, Yingbin Liang
Title: Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining
Abstract:
Pretraining large language models (LLMs) on vast and heterogeneous datasets is crucial for achieving state-of-the-art performance across diverse downstream tasks. However, current training paradigms treat all samples equally, overlooking the importance or relevance of individual samples throughout the training process. Existing reweighting strategies, which primarily focus on group-level data importance, fail to leverage fine-grained instance-level information and do not adapt dynamically to individual sample importance as training progresses. In this paper, we introduce novel algorithms for dynamic, instance-level data reweighting aimed at improving both the efficiency and effectiveness of LLM pretraining. Our methods adjust the weight of each training sample based on its loss value in an online fashion, allowing the model to dynamically focus on more informative or important samples at the current training stage. In particular, our framework allows us to systematically devise reweighting strategies deprioritizing redundant or uninformative data, which we find tend to work best. Furthermore, we develop a new theoretical framework for analyzing the impact of loss-based reweighting on the convergence of gradient-based optimization, providing the first formal characterization of how these strategies affect convergence bounds. We empirically validate our approach across a spectrum of tasks, from pretraining 7B and 1.4B parameter LLMs to smaller-scale language models and linear regression problems, demonstrating that our loss-based reweighting approach can lead to faster convergence and significantly improved performance.
中文: 本文提出基于损失值的动态实例级数据重加权算法,通过在线调整样本权重使模型聚焦于关键训练数据,理论分析和多任务实验表明该方法能加速收敛并显著提升性能。
English: This paper introduces dynamic, instance-level data reweighting algorithms that adjust sample weights based on loss values to enhance LLM pretraining efficiency and effectiveness, supported by theoretical analysis and empirical validation across various tasks.

Authors:Xingjian Diao, Chunhui Zhang, Tingxuan Wu, Ming Cheng, Zhongyu Ouyang, Weiyi Wu, Jiang Gui
Title: Learning Musical Representations for Music Performance Question Answering
Abstract:
Music performances are representative scenarios for audio-visual modeling. Unlike common scenarios with sparse audio, music performances continuously involve dense audio signals throughout. While existing multimodal learning methods on the audio-video QA demonstrate impressive capabilities in general scenarios, they are incapable of dealing with fundamental problems within the music performances: they underexplore the interaction between the multimodal signals in performance and fail to consider the distinctive characteristics of instruments and music. Therefore, existing methods tend to answer questions regarding musical performances inaccurately. To bridge the above research gaps, (i) given the intricate multimodal interconnectivity inherent to music data, our primary backbone is designed to incorporate multimodal interactions within the context of music; (ii) to enable the model to learn music characteristics, we annotate and release rhythmic and music sources in the current music datasets; (iii) for time-aware audio-visual modeling, we align the model's music predictions with the temporal dimension. Our experiments show state-of-the-art effects on the Music AVQA datasets. Our code is available at https://github.com/xid32/Amuse.
中文摘要:本研究针对现有多模态学习方法在音乐表演分析中的不足,开发了一个增强视听交互建模、融合音乐特性并确保时间对齐的框架,在音乐视听问答数据集上取得了领先效果。
English Summary: This study addresses the limitations of existing multimodal learning methods in music performance analysis by developing a framework that enhances audio-visual interaction modeling, incorporates musical characteristics, and ensures temporal alignment, achieving state-of-the-art results on Music AVQA datasets.

Authors:Yifan Hu, Peiyuan Liu, Yuante Li, Dawei Cheng, Naiqi Li, Tao Dai, Jigang Bao, Shu-Tao Xia
Title: FinMamba: Market-Aware Graph Enhanced Multi-Level Mamba for Stock Movement Prediction
Abstract:
Recently, combining stock features with inter-stock correlations has become a common and effective approach for stock movement prediction. However, financial data presents significant challenges due to its low signal-to-noise ratio and the dynamic complexity of the market, which give rise to two key limitations in existing methods. First, the relationships between stocks are highly influenced by multifaceted factors including macroeconomic market dynamics, and current models fail to adaptively capture these evolving interactions under specific market conditions. Second, for the accuracy and timeliness required by real-world trading, existing financial data mining methods struggle to extract beneficial pattern-oriented dependencies from long historical data while maintaining high efficiency and low memory consumption. To address the limitations, we propose FinMamba, a Mamba-GNN-based framework for market-aware and multi-level hybrid stock movement prediction. Specifically, we devise a dynamic graph to learn the changing representations of inter-stock relationships by integrating a pruning module that adapts to market trends. Afterward, with a selective mechanism, the multi-level Mamba discards irrelevant information and resets states to skillfully recall historical patterns across multiple time scales with linear time costs, which are then jointly optimized for reliable prediction. Extensive experiments on U.S. and Chinese stock markets demonstrate the effectiveness of our proposed FinMamba, achieving state-of-the-art prediction accuracy and trading profitability, while maintaining low computational complexity. The code is available at https://github.com/TROUBADOUR000/FinMamba.
中文摘要:FinMamba框架通过市场感知的动态图自适应学习股票间动态关系,并利用选择性Mamba机制以线性计算成本高效提取多时间尺度历史模式,有效解决了现有股票预测方法的局限性。
English Summary: The FinMamba framework addresses limitations in stock prediction by adaptively capturing dynamic inter-stock relationships through market-aware graphs and efficiently extracting multi-scale historical patterns using selective Mamba mechanisms with linear computational costs.

Authors:Bessie Dominguez-Dager, Felix Escalona, Francisco Gomez-Donoso, Miguel Cazorla
Title: CHIRLA: Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis
Abstract:
Person re-identification (Re-ID) is a key challenge in computer vision, requiring the matching of individuals across cameras, locations, and time. While most research focuses on short-term scenarios with minimal appearance changes, real-world applications demand robust systems that handle long-term variations caused by clothing and physical changes. We present CHIRLA, Comprehensive High-resolution Identification and Re-identification for Large-scale Analysis, a novel dataset designed for video-based long-term person Re-ID. CHIRLA was recorded over seven months in four connected indoor environments using seven strategically placed cameras, capturing realistic movements with substantial clothing and appearance variability. The dataset includes 22 individuals, more than five hours of video, and about 1M bounding boxes with identity annotations obtained through semi-automatic labeling. We also define benchmark protocols for person tracking and Re-ID, covering diverse and challenging scenarios such as occlusion, reappearance, and multi-camera conditions. By introducing this comprehensive benchmark, we aim to facilitate the development and evaluation of Re-ID algorithms that can reliably perform in challenging, long-term real-world scenarios. The benchmark code is publicly available at: https://github.com/bdager/CHIRLA.
中文摘要:CHIRLA数据集通过提供七个月的多摄像头视频数据,包含显著的服装变化,解决了长期行人重识别难题,并建立了基准测试以提升实际应用中的Re-ID算法性能。
English Summary: The CHIRLA dataset addresses long-term person re-identification challenges by providing seven months of multi-camera video data with significant clothing variations, establishing benchmarks to improve Re-ID algorithms for real-world applications.

Authors:Xingrun Xing, Zheng Liu, Shitao Xiao, Boyan Gao, Yiming Liang, Wanpeng Zhang, Haokun Lin, Guoqi Li, Jiajun Zhang
Title: EfficientLLM: Scalable Pruning-Aware Pretraining for Architecture-Agnostic Edge Language Models
Abstract:
Modern large language models (LLMs) driven by scaling laws, achieve intelligence emergency in large model sizes. Recently, the increasing concerns about cloud costs, latency, and privacy make it an urgent requirement to develop compact edge language models. Distinguished from direct pretraining that bounded by the scaling law, this work proposes the pruning-aware pretraining, focusing on retaining performance of much larger optimized models. It features following characteristics: 1) Data-scalable: we introduce minimal parameter groups in LLM and continuously optimize structural pruning, extending post-training pruning methods like LLM-Pruner and SparseGPT into the pretraining phase. 2) Architecture-agnostic: the LLM architecture is auto-designed using saliency-driven pruning, which is the first time to exceed SoTA human-designed LLMs in modern pretraining. We reveal that it achieves top-quality edge language models, termed EfficientLLM, by scaling up LLM compression and extending its boundary. EfficientLLM significantly outperforms SoTA baselines with $100M \sim 1B$ parameters, such as MobileLLM, SmolLM, Qwen2.5-0.5B, OLMo-1B, Llama3.2-1B in common sense benchmarks. As the first attempt, EfficientLLM bridges the performance gap between traditional LLM compression and direct pretraining methods, and we will fully open source at https://github.com/Xingrun-Xing2/EfficientLLM.
Chinese: 本研究提出剪枝感知预训练方法,开发出高效边缘语言模型EfficientLLM,通过在预训练阶段融入结构化剪枝技术,显著超越现有最优基准模型,成功弥合了传统压缩方法与直接预训练之间的性能鸿沟。
English: This work introduces pruning-aware pretraining to develop EfficientLLM, a compact edge language model that surpasses state-of-the-art baselines by integrating structural pruning during pretraining, bridging the performance gap between traditional compression and direct pretraining methods.

Authors:Shihuan He, Zhihui Lai, Ruxin Wang, Heng Kong
Title: Prototype Contrastive Consistency Learning for Semi-Supervised Medical Image Segmentation
Abstract:
Medical image segmentation is a crucial task in medical image analysis, but it can be very challenging especially when there are less labeled data but with large unlabeled data. Contrastive learning has proven to be effective for medical image segmentation in semi-supervised learning by constructing contrastive samples from partial pixels. However, although previous contrastive learning methods can mine semantic information from partial pixels within images, they ignore the whole context information of unlabeled images, which is very important to precise segmentation. In order to solve this problem, we propose a novel prototype contrastive learning method called Prototype Contrastive Consistency Segmentation (PCCS) for semi-supervised medical image segmentation. The core idea is to enforce the prototypes of the same semantic class to be closer and push the prototypes in different semantic classes far away from each other. Specifically, we construct a signed distance map and an uncertainty map from unlabeled images. The signed distance map is used to construct prototypes for contrastive learning, and then we estimate the prototype uncertainty from the uncertainty map as trade-off among prototypes. In order to obtain better prototypes, based on the student-teacher architecture, a new mechanism named prototype updating prototype is designed to assist in updating the prototypes for contrastive learning. In addition, we propose an uncertainty-consistency loss to mine more reliable information from unlabeled data. Extensive experiments on medical image segmentation demonstrate that PCCS achieves better segmentation performance than the state-of-the-art methods. The code is available at https://github.com/comphsh/PCCS.
中文: 提出的原型对比一致性分割(PCCS)方法通过结合原型对比学习和不确定性一致性损失,在半监督医学图像分割中实现了优于现有技术的分割性能。
English: The proposed Prototype Contrastive Consistency Segmentation (PCCS) method enhances semi-supervised medical image segmentation by leveraging prototype contrastive learning with uncertainty-consistency loss, achieving superior performance over existing approaches.

Authors:Qingshui Gu, Shu Li, Tianyu Zheng, Zhaoxiang Zhang
Title: Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
Abstract:
Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.
中文: Steel-LLM是2024年3月发布的中文优先开源语言模型,基于十亿参数规模并以中文数据为核心进行训练,在多项基准测试中表现优异,同时完整公开了模型开发过程与实践经验。
English: Steel-LLM is a Chinese-centric open-source language model developed from scratch in March 2024, featuring 1 billion parameters trained primarily on Chinese data with competitive benchmark performance and full transparency in its development process.

Authors:Jiachen Li, Xiaojin Gong
Title: Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification
Abstract:
Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness. Codes will be available at https://github.com/RikoLi/DCAC.
Chinese: 本文提出DCAC方法,通过相关感知条件机制将判别式与对比式重识别模型同预训练扩散模型结合,有效缓解捷径学习问题并提升特征泛化能力,从而在领域泛化重识别任务中实现最优性能。
English: This paper introduces DCAC, a novel method that integrates discriminative and contrastive Re-ID models with a pre-trained diffusion model using a correlation-aware conditioning scheme to enhance domain-generalizable re-identification by mitigating shortcut learning and improving feature generalization.

Authors:Kamil Garifullin, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov
Title: MaterialFusion: High-Quality, Zero-Shot, and Controllable Material Transfer with Diffusion Models
Abstract:
Manipulating the material appearance of objects in images is critical for applications like augmented reality, virtual prototyping, and digital content creation. We present MaterialFusion, a novel framework for high-quality material transfer that allows users to adjust the degree of material application, achieving an optimal balance between new material properties and the object's original features. MaterialFusion seamlessly integrates the modified object into the scene by maintaining background consistency and mitigating boundary artifacts. To thoroughly evaluate our approach, we have compiled a dataset of real-world material transfer examples and conducted complex comparative analyses. Through comprehensive quantitative evaluations and user studies, we demonstrate that MaterialFusion significantly outperforms existing methods in terms of quality, user control, and background preservation. Code is available at https://github.com/ControlGenAI/MaterialFusion.
Chinese: MaterialFusion是一种高质量材质转换的新框架,允许用户调节材质应用程度,在保持物体特征和背景一致性的同时,显著优于现有方法的质量和可控性。
English: MaterialFusion is a novel framework for high-quality material transfer that enables adjustable material application while preserving object features and background consistency, outperforming existing methods in quality and user control.

Authors:Zhi Zhou, Kun-Yang Yu, Shi-Yu Tian, Xiao-Wen Yang, Jiang-Xin Shi, Pengxiao Song, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li
Title: LawGPT: Knowledge-Guided Data Generation and Its Application to Legal LLM
Abstract:
Large language models (LLMs), both proprietary and open-source, have demonstrated remarkable capabilities across various natural language processing tasks. However, they face significant limitations in legal reasoning tasks. Proprietary models introduce data privacy risks and high inference costs, while open-source models underperform due to insufficient legal domain training data. To address these limitations, we study data generation for legal reasoning to improve the legal reasoning performance of open-source LLMs with the help of proprietary LLMs. This is challenging due to the lack of legal knowledge in proprietary LLMs and the difficulty in verifying the generated data. We propose KgDG, a knowledge-guided data generation framework for legal reasoning. Our framework enables leveraging legal knowledge to enhance generation diversity and introduces a refinement and verification process to ensure the quality of generated data. Moreover, we expand the generated dataset to further enhance the LLM reasoning capabilities. Using KgDG, we create a synthetic legal reasoning dataset containing 50K high-quality examples. Our trained model LawGPT outperforms existing legal-specific LLMs and achieves performance comparable to proprietary LLMs, demonstrating the effectiveness of KgDG and LawGPT. Our code and resources is publicly available at https://github.com/LAMDASZ-ML/Knowledge-Guide-Data-Generation .
中文摘要:本研究提出知识引导数据生成框架KgDG,通过生成高质量法律推理数据集提升开源大语言模型的性能,其训练的LawGPT模型在保持数据隐私和成本优势的同时,达到了与商业模型相当的法律推理能力。
English Summary: This study introduces KgDG, a knowledge-guided data generation framework that creates high-quality legal reasoning datasets to enhance open-source LLMs' performance, with the resulting LawGPT model matching proprietary LLMs' capabilities while addressing privacy and cost concerns.

Authors:Chengwen Qi, Ren Ma, Bowen Li, He Du, Binyuan Hui, Jinwang Wu, Yuanjun Laili, Conghui He
Title: Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation
Abstract:
First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs struggle to solve ProverQA problems, even with CoT prompting, highlighting the dataset's challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework. Code available at: https://github.com/opendatalab/ProverGen
Chinese: ProverGen框架创新性地结合了大语言模型与符号证明器,构建出包含逻辑连贯中间推理步骤的ProverQA数据集,即使采用思维链提示,当前最先进的LLM仍难以解决其问题,凸显了该数据集的挑战性。
English: ProverGen is a novel framework that combines Large Language Models with symbolic provers to create ProverQA, a challenging FOL reasoning dataset with coherent intermediate steps, which state-of-the-art LLMs struggle to solve even with chain-of-thought prompting.

Authors:Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen Xing
Title: ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
Abstract:
Unit test generation has become a promising and important use case of LLMs. However, existing evaluation benchmarks for assessing LLM unit test generation capabilities focus on function- or class-level code rather than more practical and challenging project-level codebases. To address such limitation, we propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language. We evaluate nine frontier LLMs on ProjectTest and the results show that all frontier LLMs tested exhibit moderate performance on ProjectTest on Python and Java, highlighting the difficulty of ProjectTest. We also conduct a thorough error analysis, which shows that even frontier LLMs, such as Claude-3.5-Sonnet, have significant basic yet critical errors, including compilation and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms. Our code and dataset is available at \href{https://github.com/YiboWANG214/ProjectTest}{ProjectTest}.
中文: ProjectTest提出了一个针对Python、Java和JavaScript的项目级单元测试生成基准,发现即使先进的大语言模型也表现中等且存在关键错误,同时通过纠错机制探索了其改进潜力。
English: ProjectTest introduces a project-level benchmark for unit test generation across Python, Java, and JavaScript, revealing that even advanced LLMs struggle with moderate performance and critical errors, while also exploring their potential through error-fixing mechanisms.

Authors:Haokai Zhao, Haowei Lou, Lina Yao, Wei Peng, Ehsan Adeli, Kilian M Pohl, Yu Zhang
Title: Diffusion Models for Computational Neuroimaging: A Survey
Abstract:
Computational neuroimaging involves analyzing brain images or signals to provide mechanistic insights and predictive tools for human cognition and behavior. While diffusion models have shown stability and high-quality generation in natural images, there is increasing interest in adapting them to analyze brain data for various neurological tasks such as data enhancement, disease diagnosis and brain decoding. This survey provides an overview of recent efforts to integrate diffusion models into computational neuroimaging. We begin by introducing the common neuroimaging data modalities, follow with the diffusion formulations and conditioning mechanisms. Then we discuss how the variations of the denoising starting point, condition input and generation target of diffusion models are developed and enhance specific neuroimaging tasks. For a comprehensive overview of the ongoing research, we provide a publicly available repository at https://github.com/JoeZhao527/dm4neuro.
Chinese: 本综述探讨了如何将扩散模型整合到计算神经影像学中,通过调整其公式和条件机制,以改进数据增强、疾病诊断和脑解码等任务。
English: This survey explores the integration of diffusion models into computational neuroimaging to enhance tasks like data improvement, disease diagnosis, and brain decoding by adapting their formulations and conditioning mechanisms.

Authors:Soobin Um, Beomsu Kim, Jong Chul Ye
Title: Boost-and-Skip: A Simple Guidance-Free Diffusion for Minority Generation
Abstract:
Minority samples are underrepresented instances located in low-density regions of a data manifold, and are valuable in many generative AI applications, such as data augmentation, creative content generation, etc. Unfortunately, existing diffusion-based minority generators often rely on computationally expensive guidance dedicated for minority generation. To address this, here we present a simple yet powerful guidance-free approach called Boost-and-Skip for generating minority samples using diffusion models. The key advantage of our framework requires only two minimal changes to standard generative processes: (i) variance-boosted initialization and (ii) timestep skipping. We highlight that these seemingly-trivial modifications are supported by solid theoretical and empirical evidence, thereby effectively promoting emergence of underrepresented minority features. Our comprehensive experiments demonstrate that Boost-and-Skip greatly enhances the capability of generating minority samples, even rivaling guidance-based state-of-the-art approaches while requiring significantly fewer computations. Code is available at https://github.com/soobin-um/BnS.
中文摘要:本文提出的Boost-and-Skip方法通过方差增强初始化和时间步跳跃这两个简单修改,无需引导机制即可有效生成数据流形中的少数样本,在保持与先进引导方法相当性能的同时大幅降低了计算开销。
English Summary: The paper introduces Boost-and-Skip, a computationally efficient guidance-free method that enhances minority sample generation in diffusion models through variance-boosted initialization and timestep skipping, achieving performance comparable to state-of-the-art guidance-based approaches with significantly reduced computational costs.

Authors:Hongyu Qu, Jianan Wei, Xiangbo Shu, Wenguan Wang
Title: Learning Clustering-based Prototypes for Compositional Zero-shot Learning
Abstract:
Learning primitive (i.e., attribute and object) concepts from seen compositions is the primary challenge of Compositional Zero-Shot Learning (CZSL). Existing CZSL solutions typically rely on oversimplified data assumptions, e.g., modeling each primitive with a single centroid primitive representation, ignoring the natural diversities of the attribute (resp. object) when coupled with different objects (resp. attribute). In this work, we develop ClusPro, a robust clustering-based prototype mining framework for CZSL that defines the conceptual boundaries of primitives through a set of diversified prototypes. Specifically, ClusPro conducts within-primitive clustering on the embedding space for automatically discovering and dynamically updating prototypes. These representative prototypes are subsequently used to repaint a well-structured and independent primitive embedding space, ensuring intra-primitive separation and inter-primitive decorrelation through prototype-based contrastive learning and decorrelation learning. Moreover, ClusPro efficiently performs prototype clustering in a non-parametric fashion without the introduction of additional learnable parameters or computational budget during testing. Experiments on three benchmarks demonstrate ClusPro outperforms various top-leading CZSL solutions under both closed-world and open-world settings.
Chinese: ClusPro是一种基于聚类的原型挖掘框架,通过多样化原型定义原始概念并重构嵌入空间,在组合零样本学习中实现了优于现有方法的性能。
English: ClusPro is a robust clustering-based prototype mining framework for Compositional Zero-Shot Learning that discovers diversified prototypes to define primitive concepts and restructures the embedding space for improved performance.

Authors:Filip Ekström Kelvinius, Oskar B. Andersson, Abhijith S. Parackal, Dong Qian, Rickard Armiento, Fredrik Lindsten
Title: WyckoffDiff -- A Generative Diffusion Model for Crystal Symmetry
Abstract:
Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fréchet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation. As a proof-of-concept study, we use WyckoffDiff to find new materials below the convex hull of thermodynamical stability.
Chinese: WyckoffDiff提出了一种基于对称性的晶体生成模型,通过离散框架确保结构对称性并实现快速生成,同时引入新的对称性评估指标,并验证了其在发现热力学稳定新材料方面的潜力。
English: WyckoffDiff introduces a symmetry-aware generative model for crystals that uses a discrete framework to ensure structural symmetry and enable rapid generation, while also proposing a new metric for evaluating symmetry in generated materials and demonstrating its potential in discovering thermodynamically stable compounds.

Authors:Vlad Hosu, Lorenzo Agnolucci, Daisuke Iso, Dietmar Saupe
Title: Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution
Abstract:
Image Quality Assessment (IQA) measures and predicts perceived image quality by human observers. Although recent studies have highlighted the critical influence that variations in the scale of an image have on its perceived quality, this relationship has not been systematically quantified. To bridge this gap, we introduce the Image Intrinsic Scale (IIS), defined as the largest scale where an image exhibits its highest perceived quality. We also present the Image Intrinsic Scale Assessment (IISA) task, which involves subjectively measuring and predicting the IIS based on human judgments. We develop a subjective annotation methodology and create the IISA-DB dataset, comprising 785 image-IIS pairs annotated by experts in a rigorously controlled crowdsourcing study. Furthermore, we propose WIISA (Weak-labeling for Image Intrinsic Scale Assessment), a strategy that leverages how the IIS of an image varies with downscaling to generate weak labels. Experiments show that applying WIISA during the training of several IQA methods adapted for IISA consistently improves the performance compared to using only ground-truth labels. The code, dataset, and pre-trained models are available at https://github.com/SonyResearch/IISA.
Chinese: 本研究提出了图像内在尺度概念及其评估任务,以系统量化图像尺度对感知质量的影响,并设计了一种弱标注策略,在适配尺度预测时显著提升了图像质量评估方法的性能。
English: This study introduces the Image Intrinsic Scale (IIS) and a corresponding assessment task to systematically quantify how image scale affects perceived quality, proposing a weak-labeling strategy that enhances IQA method performance when adapted for scale prediction.

Authors:Weijia Mao, Zhenheng Yang, Mike Zheng Shou
Title: UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths
Abstract:
Unified multimodal transformers, which handle both generation and understanding tasks within a shared parameter space, have received increasing attention in recent research. Although various unified transformers have been proposed, training these models is costly due to redundant tokens and heavy attention computation. In the past, studies on large language models have demonstrated that token pruning methods, such as Mixture of Depths (MoD), can significantly improve computational efficiency. MoD employs a router to select the most important ones for processing within a transformer layer. However, directly applying MoD-based token pruning to unified transformers will result in suboptimal performance because different tasks exhibit varying levels of token redundancy. In our work, we analyze the unified transformers by (1) examining attention weight patterns, (2) evaluating the layer importance and token redundancy, and (3) analyzing task interactions. Our findings reveal that token redundancy is primarily influenced by different tasks and layers. Building on these findings, we introduce UniMoD, a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned. We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks. Code will be released at https://github.com/showlab/UniMoD.
中文: UniMoD提出了一种任务感知的令牌剪枝方法,能在统一多模态Transformer中降低高达40%的计算成本,同时保持或提升多项任务的性能表现。
English: UniMoD introduces a task-aware token pruning method that reduces computational costs by up to 40% in unified multimodal transformers while preserving or enhancing performance across various tasks.

Authors:Sankalp Nagaonkar, Augustya Sharma, Ashish Choithani, Ashutosh Trivedi
Title: Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments
Abstract:
This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.
中文摘要:本文提出一个用于评估视觉语言模型在视频光学字符识别任务中表现的开源基准,结果表明尽管视觉语言模型在许多场景下优于传统OCR系统,但仍面临幻觉内容和对复杂文本敏感等挑战。
English Summary: This paper presents an open-source benchmark for evaluating Vision-Language Models on video OCR tasks, revealing that while VLMs can outperform traditional OCR systems in many scenarios, they still face challenges like hallucinations and sensitivity to complex text.

Authors:Shuhao Liao, Weihang Xia, Yuhong Cao, Weiheng Dai, Chengyang He, Wenjun Wu, Guillaume Sartoretti
Title: SIGMA: Sheaf-Informed Geometric Multi-Agent Pathfinding
Abstract:
The Multi-Agent Path Finding (MAPF) problem aims to determine the shortest and collision-free paths for multiple agents in a known, potentially obstacle-ridden environment. It is the core challenge for robotic deployments in large-scale logistics and transportation. Decentralized learning-based approaches have shown great potential for addressing the MAPF problems, offering more reactive and scalable solutions. However, existing learning-based MAPF methods usually rely on agents making decisions based on a limited field of view (FOV), resulting in short-sighted policies and inefficient cooperation in complex scenarios. There, a critical challenge is to achieve consensus on potential movements between agents based on limited observations and communications. To tackle this challenge, we introduce a new framework that applies sheaf theory to decentralized deep reinforcement learning, enabling agents to learn geometric cross-dependencies between each other through local consensus and utilize them for tightly cooperative decision-making. In particular, sheaf theory provides a mathematical proof of conditions for achieving global consensus through local observation. Inspired by this, we incorporate a neural network to approximately model the consensus in latent space based on sheaf theory and train it through self-supervised learning. During the task, in addition to normal features for MAPF as in previous works, each agent distributedly reasons about a learned consensus feature, leading to efficient cooperation on pathfinding and collision avoidance. As a result, our proposed method demonstrates significant improvements over state-of-the-art learning-based MAPF planners, especially in relatively large and complex scenarios, demonstrating its superiority over baselines in various simulations and real-world robot experiments. The code is available at https://github.com/marmotlab/SIGMA
中文: 本研究提出了一种将层理论融入分散式深度强化学习的新框架,通过使智能体达成局部共识来实现高效协作与避障,显著提升了多智能体路径规划在复杂场景中的性能表现。
English: This study introduces a novel framework that integrates sheaf theory with decentralized deep reinforcement learning to enhance multi-agent pathfinding by enabling agents to achieve local consensus for efficient cooperation and collision avoidance, demonstrating superior performance in complex scenarios.

Authors:Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang
Title: Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images
Abstract:
Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression
中文: 本文提出了一个公平评估数据集蒸馏与剪枝方法的基准,发现随机子集在主流设定中可与依赖软标签的复杂方法相媲美,并提出了仅使用硬标签的新框架PCA,通过聚焦图像数据实现了最优性能。
English: This paper introduces a benchmark that fairly compares dataset distillation and pruning methods, revealing that random subsets can match the performance of complex soft-label approaches, and proposes a new hard-label-only framework called PCA that achieves state-of-the-art results by focusing on image data.

Authors:Huaqiu Li, Wang Zhang, Xiaowan Hu, Tao Jiang, Zikang Chen, Haoqian Wang
Title: Prompt-SID: Learning Structural Representation Prompt via Latent Diffusion for Single-Image Denoising
Abstract:
Many studies have concentrated on constructing supervised models utilizing paired datasets for image denoising, which proves to be expensive and time-consuming. Current self-supervised and unsupervised approaches typically rely on blind-spot networks or sub-image pairs sampling, resulting in pixel information loss and destruction of detailed structural information, thereby significantly constraining the efficacy of such methods. In this paper, we introduce Prompt-SID, a prompt-learning-based single image denoising framework that emphasizes preserving of structural details. This approach is trained in a self-supervised manner using downsampled image pairs. It captures original-scale image information through structural encoding and integrates this prompt into the denoiser. To achieve this, we propose a structural representation generation model based on the latent diffusion process and design a structural attention module within the transformer-based denoiser architecture to decode the prompt. Additionally, we introduce a scale replay training mechanism, which effectively mitigates the scale gap from images of different resolutions. We conduct comprehensive experiments on synthetic, real-world, and fluorescence imaging datasets, showcasing the remarkable effectiveness of Prompt-SID. Our code will be released at https://github.com/huaqlili/Prompt-SID.
中文: 本文提出Prompt-SID自监督图像去噪框架,通过提示学习保留结构细节,克服了现有方法的局限,在多个数据集上展现出卓越性能。
English: This paper introduces Prompt-SID, a self-supervised image denoising framework that uses prompt learning to preserve structural details, overcoming limitations of existing methods and demonstrating strong performance across multiple datasets.

Authors:Qian Chen, Xingjian Dong, Kui Hu, Kangkang Chen, Zhike Peng, Guang Meng
Title: CS-SHAP: Extending SHAP to Cyclic-Spectral Domain for Better Interpretability of Intelligent Fault Diagnosis
Abstract:
Neural networks (NNs), with their powerful nonlinear mapping and end-to-end capabilities, are widely applied in mechanical intelligent fault diagnosis (IFD). However, as typical black-box models, they pose challenges in understanding their decision basis and logic, limiting their deployment in high-reliability scenarios. Hence, various methods have been proposed to enhance the interpretability of IFD. Among these, post-hoc approaches can provide explanations without changing model architecture, preserving its flexibility and scalability. However, existing post-hoc methods often suffer from limitations in explanation forms. They either require preprocessing that disrupts the end-to-end nature or overlook fault mechanisms, leading to suboptimal explanations. To address these issues, we derived the cyclic-spectral (CS) transform and proposed the CS-SHAP by extending Shapley additive explanations (SHAP) to the CS domain. CS-SHAP can evaluate contributions from both carrier and modulation frequencies, aligning more closely with fault mechanisms and delivering clearer and more accurate explanations. Three datasets are utilized to validate the superior interpretability of CS-SHAP, ensuring its correctness, reproducibility, and practical performance. With open-source code and outstanding interpretability, CS-SHAP has the potential to be widely adopted and become the post-hoc interpretability benchmark in IFD, even in other classification tasks. The code is available on https://github.com/ChenQian0618/CS-SHAP.
中文: 针对神经网络在机械智能故障诊断中可解释性不足的问题,本研究提出CS-SHAP方法,通过将SHAP扩展至循环谱域实现与故障机理契合的清晰解释,经三个数据集验证具备卓越性能。
English: Neural networks are widely used in mechanical intelligent fault diagnosis but face interpretability challenges, which the proposed CS-SHAP method addresses by extending SHAP to the cyclic-spectral domain for clearer, mechanism-aligned explanations validated across three datasets.

Authors:Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
Title: Systematic Outliers in Large Language Models
Abstract:
Outliers have been widely observed in Large Language Models (LLMs), significantly impacting model performance and posing challenges for model compression. Understanding the functionality and formation mechanisms of these outliers is critically important. Existing works, however, largely focus on reducing the impact of outliers from an algorithmic perspective, lacking an in-depth investigation into their causes and roles. In this work, we provide a detailed analysis of the formation process, underlying causes, and functions of outliers in LLMs. We define and categorize three types of outliers-activation outliers, weight outliers, and attention outliers-and analyze their distributions across different dimensions, uncovering inherent connections between their occurrences and their ultimate influence on the attention mechanism. Based on these observations, we hypothesize and explore the mechanisms by which these outliers arise and function, demonstrating through theoretical derivations and experiments that they emerge due to the self-attention mechanism's softmax operation. These outliers act as implicit context-aware scaling factors within the attention mechanism. As these outliers stem from systematic influences, we term them systematic outliers. Our study not only enhances the understanding of Transformer-based LLMs but also shows that structurally eliminating outliers can accelerate convergence and improve model compression. The code is avilable at https://github.com/an-yongqi/systematic-outliers.
中文摘要:本研究分析了大型语言模型中异常值的形成机制、成因及功能,揭示其源于自注意力机制的softmax操作并作为隐式缩放因子发挥作用,实验表明结构性地消除这些异常值可加速模型收敛并提升压缩效果。
English Summary: This study analyzes the formation, causes, and functions of outliers in Large Language Models, revealing they emerge from the softmax operation in self-attention and serve as implicit scaling factors, with their structural elimination shown to accelerate convergence and improve model compression.

Authors:Aobotao Dai, Xinyu Ma, Lei Chen, Songze Li, Lin Wang
Title: When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs
Abstract:
Vision-Language Models (VLMs) have gained considerable prominence in recent years due to their remarkable capability to effectively integrate and process both textual and visual information. This integration has significantly enhanced performance across a diverse spectrum of applications, such as scene perception and robotics. However, the deployment of VLMs has also given rise to critical safety and security concerns, necessitating extensive research to assess the potential vulnerabilities these VLM systems may harbor. In this work, we present an in-depth survey of the attack strategies tailored for VLMs. We categorize these attacks based on their underlying objectives - namely jailbreak, camouflage, and exploitation - while also detailing the various methodologies employed for data manipulation of VLMs. Meanwhile, we outline corresponding defense mechanisms that have been proposed to mitigate these vulnerabilities. By discerning key connections and distinctions among the diverse types of attacks, we propose a compelling taxonomy for VLM attacks. Moreover, we summarize the evaluation metrics that comprehensively describe the characteristics and impact of different attacks on VLMs. Finally, we conclude with a discussion of promising future research directions that could further enhance the robustness and safety of VLMs, emphasizing the importance of ongoing exploration in this critical area of study. To facilitate community engagement, we maintain an up-to-date project page, accessible at: https://github.com/AobtDai/VLM_Attack_Paper_List.
中文: 视觉语言模型(VLMs)在提升多模态应用性能的同时,面临越狱、伪装等安全威胁,促使研究防御策略并探索未来防护方向。
English: Vision-Language Models (VLMs) enhance multimodal applications but face security threats like jailbreak and camouflage attacks, prompting research into defense mechanisms and future safeguards.

Authors:Yiru Jiao, Sander van Cranenburgh, Simeon Calvert, Hans van Lint
Title: Structure-preserving contrastive learning for spatial time series
Abstract:
The effectiveness of neural network models largely relies on learning meaningful latent patterns from data, where self-supervised learning of informative representations can enhance model performance and generalisability. However, self-supervised representation learning for spatially characterised time series, which are ubiquitous in transportation domain, poses unique challenges due to the necessity of maintaining fine-grained spatio-temporal similarities in the latent space. In this study, we introduce two structure-preserving regularisers for the contrastive learning of spatial time series: one regulariser preserves the topology of similarities between instances, and the other preserves the graph geometry of similarities across spatial and temporal dimensions. To balance the contrastive learning objective and the need for structure preservation, we propose a dynamic weighting mechanism that adaptively manages this trade-off and stabilises training. We validate the proposed method through extensive experiments, including multivariate time series classification to demonstrate its general applicability, as well as macroscopic and microscopic traffic prediction to highlight its particular usefulness in encoding traffic interactions. Across all tasks, our method preserves the similarity structures more effectively and improves state-of-the-art task performances. This method can be integrated with an arbitrary neural network model and is particularly beneficial for time series data with spatial or geographical features. Furthermore, our findings suggest that well-preserved similarity structures in the latent space indicate more informative and useful representations. This provides insights to design more effective neural networks for data-driven transportation research. Our code is made openly accessible with all resulting data at https://github.com/yiru-jiao/spclt
中文: 本研究针对空间时间序列提出了两个结构保持正则化器和一个动态权重机制,通过对比学习有效保持时空相似性结构,在多项任务中提升了模型性能,尤其适用于交通领域的预测应用。
English: This study introduces two structure-preserving regularizers and a dynamic weighting mechanism for contrastive learning of spatial time series, which effectively maintains spatio-temporal similarities and enhances model performance across various tasks, particularly in transportation applications.

Authors:Filip Ekström Kelvinius, Zheng Zhao, Fredrik Lindsten
Title: Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo
Abstract:
A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on "decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data.
中文摘要:本文针对线性高斯逆问题提出了一种基于解耦扩散的序列蒙特卡洛方法,该方法允许更大的样本更新且具有渐近精确性,并在合成数据、蛋白质和图像数据上验证了有效性,同时展示了向离散数据的扩展能力。
English Summary: This paper introduces a sequential Monte Carlo method for linear-Gaussian inverse problems using decoupled diffusion, which enables larger sample updates and is proven asymptotically exact, with validation on synthetic, protein, and image data, plus an extension to discrete data.

Authors:Oliver Boyne, Roberto Cipolla
Title: FOCUS -- Multi-View Foot Reconstruction From Synthetically Trained Dense Correspondences
Abstract:
Surface reconstruction from multiple, calibrated images is a challenging task - often requiring a large number of collected images with significant overlap. We look at the specific case of human foot reconstruction. As with previous successful foot reconstruction work, we seek to extract rich per-pixel geometry cues from multi-view RGB images, and fuse these into a final 3D object. Our method, FOCUS, tackles this problem with 3 main contributions: (i) SynFoot2, an extension of an existing synthetic foot dataset to include a new data type: dense correspondence with the parameterized foot model FIND; (ii) an uncertainty-aware dense correspondence predictor trained on our synthetic dataset; (iii) two methods for reconstructing a 3D surface from dense correspondence predictions: one inspired by Structure-from-Motion, and one optimization-based using the FIND model. We show that our reconstruction achieves state-of-the-art reconstruction quality in a few-view setting, performing comparably to state-of-the-art when many views are available, and runs substantially faster. We release our synthetic dataset to the research community. Code is available at: https://github.com/OllieBoyne/FOCUS
Chinese: FOCUS方法通过扩展合成数据集、引入不确定性感知对应点预测器及两种表面重建技术,在少量视角下实现了顶尖的三维足部重建质量,且运行速度显著提升。
English: The FOCUS method introduces a synthetic dataset extension, an uncertainty-aware correspondence predictor, and two surface reconstruction techniques to achieve state-of-the-art 3D foot reconstruction quality with fewer views and faster processing.

Authors:Sihwan Park, Doohyuk Jang, Sungyub Kim, Souvik Kundu, Eunho Yang
Title: LANTERN++: Enhancing Relaxed Speculative Decoding with Static Tree Drafting for Visual Auto-regressive Models
Abstract:
Speculative decoding has been widely used to accelerate auto-regressive (AR) text generation. However, its effectiveness for visual AR models remains limited due to token selection ambiguity, where multiple tokens share similarly low probabilities and thus reduce acceptance rates. Recently, relaxed speculative decoding with dynamic tree drafting was proposed to mitigate this ambiguity, demonstrating promising results in accelerating visual AR models. However, we observe that token selection ambiguity still negatively affects dynamic tree drafting, resulting in shallow draft trees and limited acceleration. To overcome this issue, we introduce LANTERN++, a refined framework that integrates static tree drafting with a tailored relaxed acceptance condition, allowing drafts to be selected independently of low-confidence predictions. This enables the acceptance of deeper sequences, improving decoding efficiency while preserving image quality. Extensive experiments on state-of-the-art visual AR models demonstrate that LANTERN++ significantly accelerates inference, achieving up to $\mathbf{\times 2.56}$ speedup over standard AR decoding while maintaining high image quality. The code is publicly available at https://github.com/jadohu/LANTERN.
中文总结:LANTERN++ 通过结合静态树草拟和定制化的宽松接受条件,改进了视觉自回归模型的推测解码,在保持图像质量的同时实现了高达 2.56 倍的加速效果。
English Summary: LANTERN++ enhances speculative decoding for visual auto-regressive models by integrating static tree drafting with a custom relaxed acceptance condition, achieving up to 2.56× speedup while preserving image quality.

Authors:Yawei Li, David Rügamer, Bernd Bischl, Mina Rezaei
Title: Calibrating LLMs with Information-Theoretic Evidential Deep Learning
Abstract:
Fine-tuned large language models (LLMs) often exhibit overconfidence, particularly when trained on small datasets, resulting in poor calibration and inaccurate uncertainty estimates. Evidential Deep Learning (EDL), an uncertainty-aware approach, enables uncertainty estimation in a single forward pass, making it a promising method for calibrating fine-tuned LLMs. However, despite its computational efficiency, EDL is prone to overfitting, as its training objective can result in overly concentrated probability distributions. To mitigate this, we propose regularizing EDL by incorporating an information bottleneck (IB). Our approach IB-EDL suppresses spurious information in the evidence generated by the model and encourages truly predictive information to influence both the predictions and uncertainty estimates. Extensive experiments across various fine-tuned LLMs and tasks demonstrate that IB-EDL outperforms both existing EDL and non-EDL approaches. By improving the trustworthiness of LLMs, IB-EDL facilitates their broader adoption in domains requiring high levels of confidence calibration. Code is available at https://github.com/sandylaker/ib-edl.
中文: 针对微调后大语言模型常出现的过度自信和校准不佳问题,本文提出IB-EDL方法,通过信息瓶颈正则化证据深度学习,有效提升不确定性估计能力并增强模型可信度。
English: Fine-tuned large language models often suffer from overconfidence and poor calibration, which is addressed by the proposed IB-EDL method that regularizes Evidential Deep Learning with an information bottleneck to enhance uncertainty estimation and model trustworthiness.

Authors:Qi Wang, Tianfei Zhou, Ye Yuan, Rui Mao
Title: Prompt-Driven Continual Graph Learning
Abstract:
Continual Graph Learning (CGL), which aims to accommodate new tasks over evolving graph data without forgetting prior knowledge, is garnering significant research interest. Mainstream solutions adopt the memory replay-based idea, ie, caching representative data from earlier tasks for retraining the graph model. However, this strategy struggles with scalability issues for constantly evolving graphs and raises concerns regarding data privacy. Inspired by recent advancements in the prompt-based learning paradigm, this paper introduces a novel prompt-driven continual graph learning (PROMPTCGL) framework, which learns a separate prompt for each incoming task and maintains the underlying graph neural network model fixed. In this way, PROMPTCGL naturally avoids catastrophic forgetting of knowledge from previous tasks. More specifically, we propose hierarchical prompting to instruct the model from both feature- and topology-level to fully address the variability of task graphs in dynamic continual learning. Additionally, we develop a personalized prompt generator to generate tailored prompts for each graph node while minimizing the number of prompts needed, leading to constant memory consumption regardless of the graph scale. Extensive experiments on four benchmarks show that PROMPTCGL achieves superior performance against existing CGL approaches while significantly reducing memory consumption. Our code is available at https://github.com/QiWang98/PromptCGL.
中文:本文提出了PROMPTCGL框架,通过分层提示和个性化提示生成器实现持续图学习,既能防止灾难性遗忘又保持恒定内存消耗,在性能与效率上均优于现有方法。
English: This paper introduces PROMPTCGL, a prompt-driven framework for continual graph learning that uses hierarchical prompting and a personalized prompt generator to prevent catastrophic forgetting while maintaining constant memory consumption, demonstrating superior performance and efficiency over existing methods.

Authors:Jian Sun, Wei Sun, Genwei Zhang, Kailun Yang, Song Li, Xiangqi Meng, Na Deng, Chongbin Tan
Title: CT-UIO: Continuous-Time UWB-Inertial-Odometer Localization Using Non-Uniform B-spline with Fewer Anchors
Abstract:
Ultra-wideband (UWB) based positioning with fewer anchors has attracted significant research interest in recent years, especially under energy-constrained conditions. However, most existing methods rely on discrete-time representations and smoothness priors to infer a robot's motion states, which often struggle with ensuring multi-sensor data synchronization. In this paper, we present an efficient UWB-Inertial-odometer localization system, utilizing a non-uniform B-spline framework with fewer anchors. Unlike traditional uniform B-spline-based continuous-time methods, we introduce an adaptive knot-span adjustment strategy for non-uniform continuous-time trajectory representation. This is accomplished by adjusting control points dynamically based on movement speed. To enable efficient fusion of IMU and odometer data, we propose an improved Extended Kalman Filter (EKF) with innovation-based adaptive estimation to provide short-term accurate motion prior. Furthermore, to address the challenge of achieving a fully observable UWB localization system under few-anchor conditions, the Virtual Anchor (VA) generation method based on multiple hypotheses is proposed. At the backend, we propose a CT-UIO factor graph with an adaptive sliding window for global trajectory estimation. Comprehensive experiments conducted on corridor and exhibition hall datasets validate the proposed system's high precision and robust performance. The codebase and datasets of this work will be open-sourced at https://github.com/JasonSun623/CT-UIO.
中文: 本文提出了一种基于非均匀B样条的UWB-惯性-里程计定位系统,采用自适应节点跨度调整策略和改进的扩展卡尔曼滤波,在少锚点条件下实现了高精度的连续时间轨迹估计,走廊和展厅数据集验证了其优越性能。
English: This paper introduces an efficient UWB-Inertial-odometer localization system using a non-uniform B-spline framework with fewer anchors, featuring adaptive knot-span adjustment and improved EKF for robust sensor fusion, validated through experiments in corridor and exhibition hall environments.

Authors:Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
Title: Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
Abstract:
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.
中文: Jakiro通过采用专家混合模型生成多样化的令牌预测并结合混合推理策略,显著提高了推测解码的准确性和推理速度,在不同模型上均验证了其有效性。
English: Jakiro enhances speculative decoding by employing Mixture of Experts to generate diverse token predictions and a hybrid inference strategy, significantly improving both accuracy and inference speed across various models.

Authors:Zhixun Li, Dingshuo Chen, Tong Zhao, Daixin Wang, Hongrui Liu, Zhiqiang Zhang, Jun Zhou, Jeffrey Xu Yu
Title: IceBerg: Debiased Self-Training for Class-Imbalanced Node Classification
Abstract:
Graph Neural Networks (GNNs) have achieved great success in dealing with non-Euclidean graph-structured data and have been widely deployed in many real-world applications. However, their effectiveness is often jeopardized under class-imbalanced training sets. Most existing studies have analyzed class-imbalanced node classification from a supervised learning perspective, but they do not fully utilize the large number of unlabeled nodes in semi-supervised scenarios. We claim that the supervised signal is just the tip of the iceberg and a large number of unlabeled nodes have not yet been effectively utilized. In this work, we propose IceBerg, a debiased self-training framework to address the class-imbalanced and few-shot challenges for GNNs at the same time. Specifically, to figure out the Matthew effect and label distribution shift in self-training, we propose Double Balancing, which can largely improve the performance of existing baselines with just a few lines of code as a simple plug-and-play module. Secondly, to enhance the long-range propagation capability of GNNs, we disentangle the propagation and transformation operations of GNNs. Therefore, the weak supervision signals can propagate more effectively to address the few-shot issue. In summary, we find that leveraging unlabeled nodes can significantly enhance the performance of GNNs in class-imbalanced and few-shot scenarios, and even small, surgical modifications can lead to substantial performance improvements. Systematic experiments on benchmark datasets show that our method can deliver considerable performance gain over existing class-imbalanced node classification baselines. Additionally, due to IceBerg's outstanding ability to leverage unsupervised signals, it also achieves state-of-the-art results in few-shot node classification scenarios. The code of IceBerg is available at: https://github.com/ZhixunLEE/IceBerg.
中文: 本文提出IceBerg框架,通过利用未标记节点和双重平衡技术,有效提升图神经网络在类别不平衡和少样本场景下的性能表现。
English: The paper introduces IceBerg, a debiased self-training framework that enhances Graph Neural Networks' performance in class-imbalanced and few-shot scenarios by effectively leveraging unlabeled nodes and implementing double balancing techniques.

Authors:Zhaoying Wang, Yingdan Shi, Xiang Liu, Can Chen, Jun Wen, Ren Wang
Title: HODDI: A Dataset of High-Order Drug-Drug Interactions for Computational Pharmacovigilance
Abstract:
Drug-side effect research is vital for understanding adverse reactions arising in complex multi-drug therapies. However, the scarcity of higher-order datasets that capture the combinatorial effects of multiple drugs severely limits progress in this field. Existing resources such as TWOSIDES primarily focus on pairwise interactions. To fill this critical gap, we introduce HODDI, the first Higher-Order Drug-Drug Interaction Dataset, constructed from U.S. Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) records spanning the past decade, to advance computational pharmacovigilance. HODDI contains 109,744 records involving 2,506 unique drugs and 4,569 unique side effects, specifically curated to capture multi-drug interactions and their collective impact on adverse effects. Comprehensive statistical analyses demonstrate HODDI's extensive coverage and robust analytical metrics, making it a valuable resource for studying higher-order drug relationships. Evaluating HODDI with multiple models, we found that simple Multi-Layer Perceptron (MLP) can outperform graph models, while hypergraph models demonstrate superior performance in capturing complex multi-drug interactions, further validating HODDI's effectiveness. Our findings highlight the inherent value of higher-order information in drug-side effect prediction and position HODDI as a benchmark dataset for advancing research in pharmacovigilance, drug safety, and personalized medicine. The dataset and codes are available at https://github.com/TIML-Group/HODDI.
中文: HODDI是首个高阶药物相互作用数据集,基于FDA记录构建,填补了多药治疗副作用数据的空白,评估显示超图模型在捕捉复杂相互作用方面表现优异,成为药物警戒研究的基准资源。
English: HODDI is the first higher-order drug-drug interaction dataset, created from FDA records to address the scarcity of data on multi-drug therapy side effects, demonstrating through evaluations that hypergraph models excel in capturing complex interactions and establishing it as a benchmark for pharmacovigilance research.

Authors:Zhichen Dong, Zhanhui Zhou, Zhixuan Liu, Chao Yang, Chaochao Lu
Title: Emergent Response Planning in LLMs
Abstract:
In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors: $\textbf{their hidden representations encode future outputs beyond the next token}$. Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, including $\textit{structure attributes}$ (e.g., response length, reasoning steps), $\textit{content attributes}$ (e.g., character choices in storywriting, multiple-choice answers at the end of response), and $\textit{behavior attributes}$ (e.g., answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggest potential applications for improving transparency and generation control.
中文: 本研究发现大型语言模型通过隐藏表征编码未来输出的结构、内容和行为等全局属性,展现出涌现的规划能力,这种能力随模型规模扩展并在生成过程中演变,为提升透明度和控制力提供了可能。
English: This study reveals that large language models exhibit emergent planning capabilities by encoding future response attributes—such as structure, content, and behavior—in their hidden representations, which scales with model size and evolves during generation, offering potential for enhanced transparency and control.

Authors:Yueqing Wang, Yikun Mei, Zhen Gao, Ziwei Wan, Boyu Ning, De Mi, Sami Muhaidat
Title: Pre-Equalization Aided Grant-Free Massive Access in Massive MIMO System
Abstract:
The spatial diversity and multiplexing advantages of massive multi-input-multi-output (mMIMO) can significantly improve the capacity of massive non-orthogonal multiple access (NOMA) in machine type communications. However, state-of-the-art grant-free massive NOMA schemes for mMIMO systems require accurate estimation of random access channels to perform activity detection and the following coherent data demodulation, which suffers from excessive pilot overhead and access latency. To address this, we propose a pre-equalization aided grant-free massive access scheme for mMIMO systems, where an iterative detection scheme is conceived. Specifically, the base station (BS) firstly activates one of its antennas (i.e., beacon antenna) to broadcast a beacon signal, which facilitates the user equipment (UEs) to perform downlink channel estimation and pre-equalize the uplink random access signal with respect to the channels associated with the beacon antenna. During the uplink transmission stage, the BS detects UEs' activity and data by using the proposed iterative detection algorithm, which consists of three modules: coarse data detection (DD), data-aided channel estimation (CE), and fine DD. In the proposed algorithm, the joint activity and DD is firstly performed based on the signals received by the beacon antenna. Subsequently, the DD is further refined by iteratively performing data-aided CE module and fine DD module using signals received by all BS antennas. Our simulation results demonstrate that the proposed scheme outperforms state-of-the-art mMIMO-based grant-free massive NOMA schemes with the same access latency. Simulation codes are provided to reproduce the results in this article: https://github.com/owenwang517/tvt-2025.
中文: 本文提出了一种针对大规模多输入多输出系统的预均衡辅助免授权大规模接入方案,通过采用迭代检测算法降低导频开销和接入延迟,在相同延迟条件下性能优于现有方案。
English: This paper proposes a pre-equalization aided grant-free massive access scheme for mMIMO systems, which reduces pilot overhead and access latency by using an iterative detection algorithm and outperforms existing schemes with the same latency.

Authors:Zhe Huang, Tianchen Ji, Heling Zhang, Fatemeh Cheraghi Pouria, Katherine Driggs-Campbell, Roy Dong
Title: Interaction-aware Conformal Prediction for Crowd Navigation
Abstract:
During crowd navigation, robot motion plan needs to consider human motion uncertainty, and the human motion uncertainty is dependent on the robot motion plan. We introduce Interaction-aware Conformal Prediction (ICP) to alternate uncertainty-aware robot motion planning and decision-dependent human motion uncertainty quantification. ICP is composed of a trajectory predictor to predict human trajectories, a model predictive controller to plan robot motion with confidence interval radii added for probabilistic safety, a human simulator to collect human trajectory calibration dataset conditioned on the planned robot motion, and a conformal prediction module to quantify trajectory prediction error on the decision-dependent calibration dataset. Crowd navigation simulation experiments show that ICP strikes a good balance of performance among navigation efficiency, social awareness, and uncertainty quantification compared to previous works. ICP generalizes well to navigation tasks under various crowd densities. The fast runtime and efficient memory usage make ICP practical for real-world applications. Code is available at https://github.com/tedhuang96/icp.
Chinese Summary: 本文提出交互感知的保形预测方法,通过交替进行机器人运动规划与人类不确定性量化,在人群导航中实现了导航效率、社会意识及不确定性处理的良好平衡。
English Summary: The paper introduces Interaction-aware Conformal Prediction (ICP), a method that alternates between robot motion planning and human uncertainty quantification to achieve balanced performance in navigation efficiency, social awareness, and uncertainty handling during crowd navigation.

Authors:Yu Wang, Nan Yang, Liang Wang, Furu Wei, Fuli Feng
Title: Examining False Positives under Inference Scaling for Mathematical Reasoning
Abstract:
Recent advancements in language models have led to significant improvements in mathematical reasoning across various benchmarks. However, most of these benchmarks rely on automatic evaluation methods that only compare final answers using heuristics, without verifying the underlying reasoning steps. This limitation results in false positive solutions, where models may produce correct final answers but with flawed deduction paths. In this paper, we systematically examine the prevalence of false positive solutions in mathematical problem solving for language models. We analyze the characteristics and extent of this issue across different open-source models, datasets of varying difficulty levels, and decoding strategies. Specifically, we explore how false positives influence the inference time scaling behavior of language models. Our experimental results reveal that: (1) false positive solutions persist across different models, datasets, and decoding methods, (2) sampling-based inference time scaling methods do not alleviate the problem, and (3) the pass@N evaluation metric is more susceptible to false positives, suggesting a significantly lower scaling ceiling than what automatic evaluations indicate. Additionally, we analyze specific instances of false positives and discuss potential limitations in self-improvement techniques and synthetic data generation under such conditions. Our data and code are publicly available at https://github.com/Wloner0809/False-Positives-in-Math.
Chinese: 当前语言模型在数学推理中普遍存在虚假正解现象,即答案正确但推理过程存在缺陷,这一问题在不同模型和数据集上持续存在,并削弱了pass@N等自动评估指标的可靠性。
English: Current language models often produce false positive solutions in mathematical reasoning where correct answers mask flawed deduction processes, which persist across various models and datasets and undermine the reliability of automatic evaluation metrics like pass@N.

Authors:Zhi Li, Jiang Wang, Xiaoyang Li, He Kong
Title: Improved Extrinsic Calibration of Acoustic Cameras via Batch Optimization
Abstract:
Acoustic cameras have found many applications in practice. Accurate and reliable extrinsic calibration of the microphone array and visual sensors within acoustic cameras is crucial for fusing visual and auditory measurements. Existing calibration methods either require prior knowledge of the microphone array geometry or rely on grid search which suffers from slow iteration speed or poor convergence. To overcome these limitations, in this paper, we propose an automatic calibration technique using a calibration board with both visual and acoustic markers to identify each microphone position in the camera frame. We formulate the extrinsic calibration problem (between microphones and the visual sensor) as a nonlinear least squares problem and employ a batch optimization strategy to solve the associated problem. Extensive numerical simulations and realworld experiments show that the proposed method improves both the accuracy and robustness of extrinsic parameter calibration for acoustic cameras, in comparison to existing methods. To benefit the community, we open-source all the codes and data at https://github.com/AISLAB-sustech/AcousticCamera.
中文摘要:本文提出一种声学相机自动标定方法,通过结合视觉与声学标记的标定板确定麦克风在相机坐标系中的位置,采用非线性优化求解外参标定问题,实验证明该方法在精度和鲁棒性上优于现有技术。
English Summary: This paper introduces an automatic calibration method for acoustic cameras that uses a dual-marker calibration board to precisely locate microphones in the camera's coordinate system, solving the problem through nonlinear optimization and demonstrating superior accuracy and robustness in experiments.

Authors:Chengjie Zhang, Wenda Pan, Xinyang Han, He Kong
Title: Calibration of Multiple Asynchronous Microphone Arrays using Hybrid TDOA
Abstract:
Accurate calibration of acoustic sensing systems made of multiple asynchronous microphone arrays is essential for satisfactory performance in sound source localization and tracking. State-of-the-art calibration methods for this type of system rely on the time difference of arrival and direction of arrival measurements among the microphone arrays (denoted as TDOA-M and DOA, respectively). In this paper, to enhance calibration accuracy, we propose to incorporate the time difference of arrival measurements between adjacent sound events (TDOAS) with respect to the microphone arrays. More specifically, we propose a two-stage calibration approach, including an initial value estimation (IVE) procedure and the final joint optimization step. The IVE stage first initializes all parameters except for microphone array orientations, using hybrid TDOA (i.e., TDOAM and TDOA-S), odometer data from a moving robot carrying a speaker, and DOA. Subsequently, microphone orientations are estimated through the iterative closest point method. The final joint optimization step estimates multiple microphone array locations, orientations, time offsets, clock drift rates, and sound source locations simultaneously. Both simulation and experiment results show that for scenarios with low or moderate TDOA noise levels, our approach outperforms existing methods in terms of accuracy. All code and data are available at https://github.com/AISLABsustech/Hybrid-TDOA-Multi-Calib.
中文: 本文提出了一种两阶段校准方法,通过结合声音事件间的到达时间差测量与现有技术,提高了多阵列声学传感系统的校准精度,在低至中等噪声水平的仿真和实验中均表现出优于现有方法的性能。
English: This paper introduces a two-stage calibration method that combines time difference of arrival measurements between sound events with existing techniques to improve the accuracy of multi-array acoustic sensing systems, demonstrating superior performance in simulations and experiments under low to moderate noise conditions.

Authors:Guanglong Sun, Hongwei Yan, Liyuan Wang, Qian Li, Bo Lei, Yi Zhong
Title: Right Time to Learn:Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation
Abstract:
Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). Although it was originally proposed to train a more compact "student" model from a large "teacher" model, many recent efforts have focused on adapting it to promote generalization of the model itself, such as online KD and self KD. Here, we propose an accessible and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. With both theoretical and empirical analyses, we demonstrate that the benefits of the proposed Spaced KD stem from convergence to a flatter loss landscape during stochastic gradient descent (SGD). We perform extensive experiments to validate the effectiveness of Spaced KD in improving the learning performance of DNNs (e.g., the performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively). Our codes have been released on github https://github.com/SunGL001/Spaced-KD.
Chinese: 作者提出Spaced KD策略,受生物学习中的间隔效应启发,通过让学生模型从提前训练的教师模型中提取知识,有效提升了在线和自知识蒸馏的性能,实现了更平坦的损失景观,在Tiny-ImageNet上分别获得最高2.31%和3.34%的性能提升。
English: The authors propose Spaced KD, a strategy inspired by the spacing effect in biological learning that enhances online and self-knowledge distillation by having the student learn from a teacher trained with a time interval, leading to improved generalization through a flatter loss landscape and performance gains of up to 2.31% and 3.34% on Tiny-ImageNet.

Authors:Naome A. Etori, Maria L. Gini
Title: RideKE: Leveraging Low-Resource, User-Generated Twitter Content for Sentiment and Emotion Detection in Kenyan Code-Switched Dataset
Abstract:
Social media has become a crucial open-access platform for individuals to express opinions and share experiences. However, leveraging low-resource language data from Twitter is challenging due to scarce, poor-quality content and the major variations in language use, such as slang and code-switching. Identifying tweets in these languages can be difficult as Twitter primarily supports high-resource languages. We analyze Kenyan code-switched data and evaluate four state-of-the-art (SOTA) transformer-based pretrained models for sentiment and emotion classification, using supervised and semi-supervised methods. We detail the methodology behind data collection and annotation, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2\%) and F1 score (66.1\%), XLM-R semi-supervised (67.2\% accuracy, 64.1\% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8\%) and F1 score (31\%), mBERT semi-supervised (accuracy (59\% and F1 score 26.5\%). AfriBERTa models show the lowest accuracy and F1 scores. All models tend to predict neutral sentiment, with Afri-BERT showing the highest bias and unique sensitivity to empathy emotion. https://github.com/NEtori21/Ride_hailing
中文: 本研究评估了四种基于Transformer的模型对肯尼亚语码转换推特数据进行情感与情绪分类的效果,发现XLM-R在情感分析中表现最优而DistilBERT在情绪分析中领先,所有模型均呈现中性预测倾向并具有独特的情感偏差特征。
English: This study evaluates four transformer-based models for sentiment and emotion classification on Kenyan code-switched Twitter data, finding that XLM-R performs best for sentiment analysis while DistilBERT leads in emotion analysis, with all models showing a tendency toward neutral predictions and unique emotional biases.

Authors:Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q. Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie
Title: Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models
Abstract:
While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.
Chinese: 近期大型视觉语言模型常产生与视觉输入不符的幻觉文本,而新型无训练算法DeGF通过文本到图像生成模型在解码过程中提供自反馈,有效减少幻觉生成,在多个基准测试中超越现有最优方法。
English: Recent Large Vision-Language Models (LVLMs) often produce hallucinatory text responses, but a new training-free algorithm called Decoding with Generative Feedback (DeGF) leverages text-to-image generative models to provide self-feedback during decoding, effectively reducing hallucinations and outperforming state-of-the-art methods across multiple benchmarks.

Authors:Yuhao Cao, Yu Wang, Haoyao Chen
Title: Real-Time LiDAR Point Cloud Compression and Transmission for Resource-constrained Robots
Abstract:
LiDARs are widely used in autonomous robots due to their ability to provide accurate environment structural information. However, the large size of point clouds poses challenges in terms of data storage and transmission. In this paper, we propose a novel point cloud compression and transmission framework for resource-constrained robotic applications, called RCPCC. We iteratively fit the surface of point clouds with a similar range value and eliminate redundancy through their spatial relationships. Then, we use Shape-adaptive DCT (SA-DCT) to transform the unfit points and reduce the data volume by quantizing the transformed coefficients. We design an adaptive bitrate control strategy based on QoE as the optimization goal to control the quality of the transmitted point cloud. Experiments show that our framework achieves compression rates of 40$\times$ to 80$\times$ while maintaining high accuracy for downstream applications. our method significantly outperforms other baselines in terms of accuracy when the compression rate exceeds 70$\times$. Furthermore, in situations of reduced communication bandwidth, our adaptive bitrate control strategy demonstrates significant QoE improvements. The code will be available at https://github.com/HITSZ-NRSL/RCPCC.git.
Chinese: 本文提出了一种名为RCPCC的新型点云压缩框架,通过曲面拟合、SA-DCT变换和自适应码率控制,在资源受限的机器人应用中实现了40-80倍的压缩率,同时保持了下游应用的高精度需求。
English: This paper introduces RCPCC, a novel compression framework for LiDAR point clouds in resource-constrained robots, achieving 40-80x compression while maintaining high accuracy through surface fitting, SA-DCT transformation, and adaptive bitrate control.

Authors:Dongyuan Li, Satoshi Kosugi, Ying Zhang, Manabu Okumura, Feng Xia, Renhe Jiang
Title: Revisiting Dynamic Graph Clustering via Matrix Factorization
Abstract:
Dynamic graph clustering aims to detect and track time-varying clusters in dynamic graphs, revealing the evolutionary mechanisms of complex real-world dynamic systems. Matrix factorization-based methods are promising approaches for this task; however, these methods often struggle with scalability and can be time-consuming when applied to large-scale dynamic graphs. Moreover, they tend to lack robustness and are vulnerable to real-world noisy data. To address these issues, we make three key contributions. First, to improve scalability, we propose temporal separated matrix factorization, where a single matrix is divided into multiple smaller matrices for independent factorization, resulting in faster computation. Second, to improve robustness, we introduce bi-clustering regularization, which jointly optimizes graph embedding and clustering, thereby filtering out noisy features from the graph embeddings. Third, to further enhance effectiveness and efficiency, we propose selective embedding updating, where we update only the embeddings of dynamic nodes while the embeddings of static nodes are fixed among different timestamps. Experimental results on six synthetic and five real-world benchmarks demonstrate the scalability, robustness and effectiveness of our proposed method. Source code is available at https://github.com/Clearloveyuan/DyG-MF.
中文: 本研究提出了一种可扩展且鲁棒的动态图聚类方法,通过时序分离矩阵分解、双聚类正则化和选择性嵌入更新,有效处理大规模含噪声图数据。
English: This study introduces a scalable and robust dynamic graph clustering method using temporal separated matrix factorization, bi-clustering regularization, and selective embedding updating to efficiently handle large-scale noisy graphs.

Authors:Jian Xu, Sichun Luo, Xiangyu Chen, Haoming Huang, Hanxu Hou, Linqi Song
Title: RALLRec: Improving Retrieval Augmented Large Language Model Recommendation with Representation Learning
Abstract:
Large Language Models (LLMs) have been integrated into recommendation systems to enhance user behavior comprehension. The Retrieval Augmented Generation (RAG) technique is further incorporated into these systems to retrieve more relevant items and improve system performance. However, existing RAG methods rely primarily on textual semantics and often fail to incorporate the most relevant items, limiting the effectiveness of the systems. In this paper, we propose Representation learning for retrieval-Augmented Large Language model Recommendation (RALLRec). Specifically, we enhance textual semantics by prompting LLMs to generate more detailed item descriptions, followed by joint representation learning of textual and collaborative semantics, which are extracted by the LLM and recommendation models, respectively. Considering the potential time-varying characteristics of user interest, a simple yet effective reranking method is further introduced to capture the dynamics of user preference. We conducted extensive experiments on three real-world datasets, and the evaluation results validated the effectiveness of our method. Code is made public at https://github.com/JianXu95/RALLRec.
中文摘要:本文提出RALLRec方法,通过结合大语言模型生成的详细项目描述与协同过滤,并引入重排序技术来捕捉用户兴趣的动态变化,实验证明该方法能有效提升推荐系统性能。
English Summary: The paper introduces RALLRec, a method that enhances recommendation systems by combining detailed LLM-generated item descriptions with collaborative filtering and a reranking technique to adapt to dynamic user preferences, showing effectiveness in experiments.

Authors:Saptarshi Ghosh, Tianyu Jiang
Title: ConMeC: A Dataset for Metonymy Resolution with Common Nouns
Abstract:
Metonymy plays an important role in our daily communication. People naturally think about things using their most salient properties or commonly related concepts. For example, by saying "The bus decided to skip our stop today," we actually mean that the bus driver made the decision, not the bus. Prior work on metonymy resolution has mainly focused on named entities. However, metonymy involving common nouns (such as desk, baby, and school) is also a frequent and challenging phenomenon. We argue that NLP systems should be capable of identifying the metonymic use of common nouns in context. We create a new metonymy dataset ConMeC, which consists of 6,000 sentences, where each sentence is paired with a target common noun and annotated by humans to indicate whether that common noun is used metonymically or not in that context. We also introduce a chain-of-thought based prompting method for detecting metonymy using large language models (LLMs). We evaluate our LLM-based pipeline, as well as a supervised BERT model on our dataset and three other metonymy datasets. Our experimental results demonstrate that LLMs could achieve performance comparable to the supervised BERT model on well-defined metonymy categories, while still struggling with instances requiring nuanced semantic understanding. Our dataset is publicly available at: https://github.com/SaptGhosh/ConMeC.
Chinese: 本研究推出了用于检测普通名词转喻的新数据集ConMeC,并证明大型语言模型在识别转喻用法上可与监督式BERT模型相媲美,但在处理语义细微的实例时仍存在困难。
English: This study introduces ConMeC, a new dataset for detecting metonymy in common nouns, and demonstrates that large language models can perform comparably to supervised BERT models in identifying metonymic usage, though challenges remain with nuanced cases.

Authors:Seokwon Song, Taehyun Lee, Jaewoo Ahn, Jae Hyuk Sung, Gunhee Kim
Title: Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type
Abstract:
Conceptual combination is a cognitive process that merges basic concepts, enabling the creation of complex expressions. During this process, the properties of combination (e.g., the whiteness of a peeled apple) can be inherited from basic concepts, newly emerge, or be canceled. However, previous studies have evaluated a limited set of properties and have not examined the generative process. To address this gap, we introduce the Conceptual Combination with Property Type dataset (CCPT), which consists of 12.3K annotated triplets of noun phrases, properties, and property types. Using CCPT, we establish three types of tasks to evaluate LLMs for conceptual combination thoroughly. Our key findings are threefold: (1) Our automatic metric grading property emergence and cancellation closely corresponds with human judgments. (2) LLMs, including OpenAI's o1, struggle to generate noun phrases which possess given emergent properties. (3) Our proposed method, inspired by cognitive psychology model that explains how relationships between concepts are formed, improves performances in all generative tasks. The dataset and experimental code are available at https://github.com/seokwon99/CCPT.git.
中文: 本研究引入CCPT数据集评估大语言模型的概念组合能力,发现模型在生成具有涌现属性的名词短语时存在困难,但通过认知心理学启发的改进方法能有效提升表现。
English: The study introduces the CCPT dataset to evaluate how large language models handle conceptual combination, finding they struggle with emergent properties but can be improved using cognitive psychology-inspired methods.

Authors:Krishna Sri Ipsit Mantri, Carola-Bibiane Schönlieb, Bruno Ribeiro, Chaim Baskin, Moshe Eliasof
Title: DiTASK: Multi-Task Fine-Tuning with Diffeomorphic Transformations
Abstract:
Pre-trained Vision Transformers now serve as powerful tools for computer vision. Yet, efficiently adapting them for multiple tasks remains a challenge that arises from the need to modify the rich hidden representations encoded by the learned weight matrices, without inducing interference between tasks. Current parameter-efficient methods like LoRA, which apply low-rank updates, force tasks to compete within constrained subspaces, ultimately degrading performance. We introduce DiTASK a novel Diffeomorphic Multi-Task Fine-Tuning approach that maintains pre-trained representations by preserving weight matrix singular vectors, while enabling task-specific adaptations through neural diffeomorphic transformations of the singular values. By following this approach, DiTASK enables both shared and task-specific feature modulations with minimal added parameters. Our theoretical analysis shows that DITASK achieves full-rank updates during optimization, preserving the geometric structure of pre-trained features, and establishing a new paradigm for efficient multi-task learning (MTL). Our experiments on PASCAL MTL and NYUD show that DiTASK achieves state-of-the-art performance across four dense prediction tasks, using 75% fewer parameters than existing methods. Our code is available [here](https://github.com/ipsitmantri/DiTASK).
中文: DiTASK提出了一种新颖的多任务微调方法,通过微分同胚变换高效适配预训练视觉变换器,以显著更少的参数实现了最先进的性能。
English: DiTASK introduces a novel multi-task fine-tuning method that uses diffeomorphic transformations to adapt pre-trained vision transformers efficiently, achieving state-of-the-art performance with significantly fewer parameters.

Authors:Xingjian Diao, Chunhui Zhang, Weiyi Wu, Zhongyu Ouyang, Peijun Qing, Ming Cheng, Soroush Vosoughi, Jiang Gui
Title: Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding
Abstract:
Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences, a crucial requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model's limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.
中文: 提出的时序工作记忆(TWM)模块通过选择性保留跨时间序列的任务相关信息,优化多模态基础模型的容量使用,显著提升了视频与音频处理任务的性能表现。
English: The proposed temporal working memory (TWM) module enhances multimodal foundation models by selectively retaining task-relevant information across temporal sequences, significantly improving performance in video and audio tasks through optimized capacity utilization.

Authors:Raza Imam, Asif Hanif, Jian Zhang, Khaled Waleed Dawoud, Yova Kementchedjhieva, Mohammad Yaqub
Title: Noise is an Efficient Learner for Zero-Shot Vision-Language Models
Abstract:
Recently, test-time adaptation has garnered attention as a method for tuning models without labeled data. The conventional modus operandi for adapting pre-trained vision-language models (VLMs) during test-time primarily focuses on tuning learnable prompts; however, this approach overlooks potential distribution shifts in the visual representations themselves. In this work, we address this limitation by introducing Test-Time Noise Tuning (TNT), a novel method for handling unpredictable shifts in the visual space. TNT leverages, for the first time, a noise adaptation strategy that optimizes learnable noise directly in the visual input space, enabling adaptive feature learning from a single test sample. We further introduce a novel approach for inter-view representation alignment by explicitly enforcing coherence in embedding distances, ensuring consistent feature representations across views. Combined with scaled logits and confident view selection at inference, TNT substantially enhances VLM generalization and calibration, achieving average gains of +7.38% on natural distributions benchmark and +0.80% on cross-dataset evaluations over zero-shot CLIP. These improvements lay a strong foundation for adaptive out-of-distribution handling.
中文: 本文提出测试时噪声调优(TNT)方法,通过在视觉输入空间优化可学习噪声来处理不可预测的分布偏移,相比零样本CLIP显著提升了视觉语言模型的泛化能力和校准性能。
English: This paper introduces Test-Time Noise Tuning (TNT), a novel method that optimizes learnable noise in the visual input space to handle unpredictable distribution shifts, significantly improving vision-language model generalization and calibration over zero-shot CLIP.

Authors:Jusheng Zhang, Yijia Fan, Kaitong Cai, Keze Wang
Title: Kolmogorov-Arnold Fourier Networks
Abstract:
Although Kolmogorov-Arnold based interpretable networks (KAN) have strong theoretical expressiveness, they face significant parameter explosion and high-frequency feature capture challenges in high-dimensional tasks. To address this issue, we propose the Kolmogorov-Arnold-Fourier Network (KAF), which effectively integrates trainable Random Fourier Features (RFF) and a novel hybrid GELU-Fourier activation mechanism to balance parameter efficiency and spectral representation capabilities. Our key technical contributions include: (1) merging KAN's dual-matrix structure through matrix association properties to substantially reduce parameters; (2) introducing learnable RFF initialization strategies to eliminate spectral distortion in high-dimensional approximation tasks; (3) implementing an adaptive hybrid activation function that progressively enhances frequency representation during the training process. Comprehensive experiments demonstrate the superiority of our KAF across various domains including vision, NLP, audio processing, and differential equation-solving tasks, effectively combining theoretical interpretability with practical utility and computational efficiency.
中文摘要:提出的Kolmogorov-Arnold-Fourier网络(KAF)通过融合可训练傅里叶特征和混合激活机制,有效解决了可解释网络在高维任务中的参数爆炸与高频特征捕获难题,在保持理论可解释性的同时实现了跨领域应用的卓越性能。
English Summary: The proposed Kolmogorov-Arnold-Fourier Network (KAF) overcomes parameter explosion and high-frequency limitations in interpretable networks by integrating trainable Fourier features and hybrid activation, achieving superior performance across multiple domains with enhanced efficiency.

Authors:Venktesh V, Vinay Setty
Title: FactIR: A Real-World Zero-shot Open-Domain Retrieval Benchmark for Fact-Checking
Abstract:
The field of automated fact-checking increasingly depends on retrieving web-based evidence to determine the veracity of claims in real-world scenarios. A significant challenge in this process is not only retrieving relevant information, but also identifying evidence that can both support and refute complex claims. Traditional retrieval methods may return documents that directly address claims or lean toward supporting them, but often struggle with more complex claims requiring indirect reasoning. While some existing benchmarks and methods target retrieval for fact-checking, a comprehensive real-world open-domain benchmark has been lacking. In this paper, we present a real-world retrieval benchmark FactIR, derived from Factiverse production logs, enhanced with human annotations. We rigorously evaluate state-of-the-art retrieval models in a zero-shot setup on FactIR and offer insights for developing practical retrieval systems for fact-checking. Code and data are available at https://github.com/factiverse/factIR.
中文: 本文提出了FactIR这一现实世界检索基准,用于自动化事实核查,评估先进模型在识别支持或反驳复杂主张证据方面的能力,弥补了现有方法的不足。
English: This paper introduces FactIR, a real-world retrieval benchmark for automated fact-checking that evaluates state-of-the-art models in identifying evidence to support or refute complex claims, addressing gaps in existing methods.

Authors:Julia Hornauer, Amir El-Ghoussani, Vasileios Belagiannis
Title: Revisiting Gradient-based Uncertainty for Monocular Depth Estimation
Abstract:
Monocular depth estimation, similar to other image-based tasks, is prone to erroneous predictions due to ambiguities in the image, for example, caused by dynamic objects or shadows. For this reason, pixel-wise uncertainty assessment is required for safety-critical applications to highlight the areas where the prediction is unreliable. We address this in a post hoc manner and introduce gradient-based uncertainty estimation for already trained depth estimation models. To extract gradients without depending on the ground truth depth, we introduce an auxiliary loss function based on the consistency of the predicted depth and a reference depth. The reference depth, which acts as pseudo ground truth, is in fact generated using a simple image or feature augmentation, making our approach simple and effective. To obtain the final uncertainty score, the derivatives w.r.t. the feature maps from single or multiple layers are calculated using back-propagation. We demonstrate that our gradient-based approach is effective in determining the uncertainty without re-training using the two standard depth estimation benchmarks KITTI and NYU. In particular, for models trained with monocular sequences and therefore most prone to uncertainty, our method outperforms related approaches. In addition, we publicly provide our code and models: https://github.com/jhornauer/GrUMoDepth
中文: 本文提出了一种基于梯度的单目深度预测不确定性估计方法,通过辅助损失函数和特征增强生成参考深度作为伪真值,无需重新训练模型即可有效识别预测不可靠区域。
English: This paper introduces a gradient-based uncertainty estimation method for monocular depth prediction models, which uses an auxiliary loss function and feature augmentation to generate reference depth as pseudo ground truth, effectively identifying unreliable areas without model retraining.

Authors:Jiabin Tang, Tianyu Fan, Chao Huang
Title: AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents
Abstract:
Large Language Model (LLM) Agents have demonstrated remarkable capabilities in task automation and intelligent decision-making, driving the widespread adoption of agent development frameworks such as LangChain and AutoGen. However, these frameworks predominantly serve developers with extensive technical expertise - a significant limitation considering that only 0.03 % of the global population possesses the necessary programming skills. This stark accessibility gap raises a fundamental question: Can we enable everyone, regardless of technical background, to build their own LLM agents using natural language alone? To address this challenge, we introduce AutoAgent-a Fully-Automated and highly Self-Developing framework that enables users to create and deploy LLM agents through Natural Language Alone. Operating as an autonomous Agent Operating System, AutoAgent comprises four key components: i) Agentic System Utilities, ii) LLM-powered Actionable Engine, iii) Self-Managing File System, and iv) Self-Play Agent Customization module. This lightweight yet powerful system enables efficient and dynamic creation and modification of tools, agents, and workflows without coding requirements or manual intervention. Beyond its code-free agent development capabilities, AutoAgent also serves as a versatile multi-agent system for General AI Assistants. Comprehensive evaluations on the GAIA benchmark demonstrate AutoAgent's effectiveness in generalist multi-agent tasks, surpassing existing state-of-the-art methods. Furthermore, AutoAgent's Retrieval-Augmented Generation (RAG)-related capabilities have shown consistently superior performance compared to many alternative LLM-based solutions.
中文:AutoAgent是一个全自动框架,允许用户仅通过自然语言创建和部署LLM智能体,突破了现有框架的技术壁垒,并在多智能体任务和检索增强生成能力上展现出卓越性能。
English: AutoAgent is a fully automated framework that enables users to create and deploy LLM agents using natural language alone, overcoming the technical barriers of existing frameworks and demonstrating superior performance in multi-agent tasks and RAG capabilities.

Authors:Jiabin Tang, Tianyu Fan, Chao Huang
Title: AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents
Abstract:
Large Language Model (LLM) Agents have demonstrated remarkable capabilities in task automation and intelligent decision-making, driving the widespread adoption of agent development frameworks such as LangChain and AutoGen. However, these frameworks predominantly serve developers with extensive technical expertise - a significant limitation considering that only 0.03 % of the global population possesses the necessary programming skills. This stark accessibility gap raises a fundamental question: Can we enable everyone, regardless of technical background, to build their own LLM agents using natural language alone? To address this challenge, we introduce AutoAgent-a Fully-Automated and highly Self-Developing framework that enables users to create and deploy LLM agents through Natural Language Alone. Operating as an autonomous Agent Operating System, AutoAgent comprises four key components: i) Agentic System Utilities, ii) LLM-powered Actionable Engine, iii) Self-Managing File System, and iv) Self-Play Agent Customization module. This lightweight yet powerful system enables efficient and dynamic creation and modification of tools, agents, and workflows without coding requirements or manual intervention. Beyond its code-free agent development capabilities, AutoAgent also serves as a versatile multi-agent system for General AI Assistants. Comprehensive evaluations on the GAIA benchmark demonstrate AutoAgent's effectiveness in generalist multi-agent tasks, surpassing existing state-of-the-art methods. Furthermore, AutoAgent's Retrieval-Augmented Generation (RAG)-related capabilities have shown consistently superior performance compared to many alternative LLM-based solutions.
中文:AutoAgent是一个全自动框架,允许用户仅通过自然语言创建和部署LLM智能体,突破了现有框架的技术壁垒,并在多智能体任务和检索增强生成能力上展现出卓越性能。
English: AutoAgent is a fully automated framework that enables users to create and deploy LLM agents using natural language alone, overcoming the technical barriers of existing frameworks and demonstrating superior performance in multi-agent tasks and RAG capabilities.

Authors:Paul Darm, Annalisa Riccardi
Title: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models
Abstract:
Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies fine-grained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. We also demonstrate that applying interventions in the negative direction can prevent a common jailbreak attack. Our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviours. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety, requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.
中文摘要:本研究揭示通过对特定注意力头进行针对性干预,能有效突破大语言模型的安全防护机制并引导其生成有害内容,该方法比传统微调更具精准控制优势。
English Summary: This study reveals that targeted interventions on specific attention heads can effectively bypass LLM safety alignments and steer model behavior toward harmful outputs, offering a fine-grained control method that surpasses traditional fine-tuning approaches.

Authors:Hongye Liu, Ricardo Henao
Title: Learning to Substitute Words with Model-based Score Ranking
Abstract:
Smart word substitution aims to enhance sentence quality by improving word choices; however current benchmarks rely on human-labeled data. Since word choices are inherently subjective, ground-truth word substitutions generated by a small group of annotators are often incomplete and likely not generalizable. To circumvent this issue, we instead employ a model-based score (BARTScore) to quantify sentence quality, thus forgoing the need for human annotations. Specifically, we use this score to define a distribution for each word substitution, allowing one to test whether a substitution is statistically superior relative to others. In addition, we propose a loss function that directly optimizes the alignment between model predictions and sentence scores, while also enhancing the overall quality score of a substitution. Crucially, model learning no longer requires human labels, thus avoiding the cost of annotation while maintaining the quality of the text modified with substitutions. Experimental results show that the proposed approach outperforms both masked language models (BERT, BART) and large language models (GPT-4, LLaMA). The source code is available at https://github.com/Hyfred/Substitute-Words-with-Ranking.
中文摘要:本研究提出了一种基于模型的评分方法BARTScore,无需人工标注即可评估词语替换效果,通过优化预测与句子质量的匹配度,在实验中超越了现有语言模型的性能。
English Summary: This study introduces a model-based scoring method, BARTScore, to evaluate word substitutions without human annotations, optimizing alignment between predictions and sentence quality while outperforming existing language models.

Authors:Hongyu Ge, Longkun Hao, Zihui Xu, Zhenxin Lin, Bin Li, Shoujun Zhou, Hongjin Zhao, Yihang Liu
Title: ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images
Abstract:
Medical Visual Question Answering (Med-VQA) represents a critical and challenging subtask within the general VQA domain. Despite significant advancements in general VQA, multimodal large language models (MLLMs) still exhibit substantial limitations when handling multi-task VQA scenarios. These limitations manifest through erroneous spatial localization and misinterpretation of medical images, which primarily arise from two fundamental issues: inadequate image-text alignment and insufficient domain-specified knowledge for medical applications. To address these issues, we introduce the Cross-Modal Clinical Knowledge Distiller (ClinKD), an innovative framework designed to enhance image-text alignment and establish more effective medical knowledge transformation mechanisms, which enables MLLMs to perform better even when lacking prior medical knowledge. Our extensive experimental evaluations demonstrate that the ClinKD achieves state-of-the-art performance on several datasets which are challenging for Med-VQA task. The results indicate that our approach not only significantly improves image-text alignment but also effectively enables MLLMs to adapt to the medical knowledge. The source code for ClinKD is available at: https://github.com/overloadedHenry/ClinKD.
中文:提出的跨模态临床知识蒸馏框架通过加强图像-文本对齐和医学知识迁移,有效解决了医学视觉问答中的关键问题,在多个数据集上实现了最优性能。
English: The proposed Cross-Modal Clinical Knowledge Distiller (ClinKD) framework addresses medical VQA limitations by enhancing image-text alignment and medical knowledge transfer, achieving state-of-the-art performance across multiple datasets.

Authors:Yu Shang, Chen Gao, Nian Li, Yong Li
Title: A Large-scale Dataset with Behavior, Attributes, and Content of Mobile Short-video Platform
Abstract:
Short-video platforms show an increasing impact on people's daily lives nowadays, with billions of active users spending plenty of time each day. The interactions between users and online platforms give rise to many scientific problems across computational social science and artificial intelligence. However, despite the rapid development of short-video platforms, currently there are serious shortcomings in existing relevant datasets on three aspects: inadequate user-video feedback, limited user attributes and lack of video content. To address these problems, we provide a large-scale dataset with rich user behavior, attributes and video content from a real mobile short-video platform. This dataset covers 10,000 voluntary users and 153,561 videos, and we conduct four-fold technical validations of the dataset. First, we verify the richness of the behavior and attribute data. Second, we confirm the representing ability of the content features. Third, we provide benchmarking results on recommendation algorithms with our dataset. Finally, we explore the filter bubble phenomenon on the platform using the dataset. We believe the dataset could support the broad research community, including but not limited to user modeling, social science, human behavior understanding, etc. The dataset and code is available at https://github.com/tsinghua-fib-lab/ShortVideo_dataset.
短视频平台深刻影响日常生活,但现有数据集在用户反馈、属性和内容方面存在不足,为此我们发布了大规模验证数据集以支持多领域研究。
Short-video platforms significantly influence daily life, yet existing datasets lack comprehensive user-video feedback, attributes, and content, prompting the release of a large-scale validated dataset to support diverse research fields.

Authors:Runchuan Zhu, Zinco Jiang, Jiang Wu, Zhipeng Ma, Jiahe Song, Fengshuo Bai, Dahua Lin, Lijun Wu, Conghui He
Title: GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation
Abstract:
Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .
Chinese: GRAIT框架通过梯度驱动的样本选择减少幻觉,并采用自适应权重机制防止过度拒绝,从而在问答任务中优于现有方法,提升了大型语言模型的性能。
English: GRAIT is a framework that enhances large language models by using gradient-driven sample selection to reduce hallucinations and an adaptive weighting mechanism to prevent over-refusal, achieving better performance in question-answering tasks than existing methods.

Authors:Vera Soboleva, Maksim Nakhodnov, Aibek Alanov
Title: Beyond Fine-Tuning: A Systematic Study of Sampling Techniques in Personalized Image Generation
Abstract:
Personalized text-to-image generation aims to create images tailored to user-defined concepts and textual descriptions. Balancing the fidelity of the learned concept with its ability for generation in various contexts presents a significant challenge. Existing methods often address this through diverse fine-tuning parameterizations and improved sampling strategies that integrate superclass trajectories during the diffusion process. While improved sampling offers a cost-effective, training-free solution for enhancing fine-tuned models, systematic analyses of these methods remain limited. Current approaches typically tie sampling strategies with fixed fine-tuning configurations, making it difficult to isolate their impact on generation outcomes. To address this issue, we systematically analyze sampling strategies beyond fine-tuning, exploring the impact of concept and superclass trajectories on the results. Building on this analysis, we propose a decision framework evaluating text alignment, computational constraints, and fidelity objectives to guide strategy selection. It integrates with diverse architectures and training approaches, systematically optimizing concept preservation, prompt adherence, and resource efficiency. The source code can be found at https://github.com/ControlGenAI/PersonGenSampler.
中文: 本研究系统分析了个性化文本到图像生成中的采样策略,提出了一个决策框架,通过评估文本对齐、计算约束和保真度来优化不同架构下的概念保持与资源效率。
English: This study systematically analyzes sampling strategies in personalized text-to-image generation, proposing a decision framework that evaluates text alignment, computational constraints, and fidelity to optimize concept preservation and resource efficiency across various architectures.

Authors:Lu Chen, Yizhou Wang, Shixiang Tang, Qianhong Ma, Tong He, Wanli Ouyang, Xiaowei Zhou, Hujun Bao, Sida Peng
Title: EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
Abstract:
Learning an agent model that behaves like humans-capable of jointly perceiving the environment, predicting the future, and taking actions from a first-person perspective-is a fundamental challenge in computer vision. Existing methods typically train separate models for these abilities, which fail to capture their intrinsic relationships and prevent them from learning from each other. Inspired by how humans learn through the perception-action loop, we propose EgoAgent, a unified agent model that simultaneously learns to represent, predict, and act within a single transformer. EgoAgent explicitly models the causal and temporal dependencies among these abilities by formulating the task as an interleaved sequence of states and actions. It further introduces a joint embedding-action-prediction architecture with temporally asymmetric predictor and observer branches, enabling synergistic optimization across all three capabilities. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction demonstrate the superiority of our method. The code and trained models will be publicly available at https://github.com/zju3dv/EgoAgent.
Chinese: EgoAgent是一种基于Transformer的统一智能体模型,通过单一架构同时学习感知、预测和行动,并通过显式建模它们之间的因果与时间依赖关系,在多项任务中展现出卓越性能。
English: EgoAgent is a unified transformer-based model that jointly learns perception, prediction, and action through a single architecture, demonstrating superior performance across multiple tasks by explicitly modeling their causal and temporal dependencies.

Authors:Yan Li, Zhulin Wang, Jing Liu, Lei Guo, Philippe Fournier-Viger, Youxi Wu, Xindong Wu
Title: NSPG-Miner: Mining Repetitive Negative Sequential Patterns
Abstract:
Sequential pattern mining (SPM) with gap constraints (or repetitive SPM or tandem repeat discovery in bioinformatics) can find frequent repetitive subsequences satisfying gap constraints, which are called positive sequential patterns with gap constraints (PSPGs). However, classical SPM with gap constraints cannot find the frequent missing items in the PSPGs. To tackle this issue, this paper explores negative sequential patterns with gap constraints (NSPGs). We propose an efficient NSPG-Miner algorithm that can mine both frequent PSPGs and NSPGs simultaneously. To effectively reduce candidate patterns, we propose a pattern join strategy with negative patterns which can generate both positive and negative candidate patterns at the same time. To calculate the support (frequency of occurrence) of a pattern in each sequence, we explore a NegPair algorithm that employs a key-value pair array structure to deal with the gap constraints and the negative items simultaneously and can avoid redundant rescanning of the original sequence, thus improving the efficiency of the algorithm. To report the performance of NSPG-Miner, 11 competitive algorithms and 11 datasets are employed. The experimental results not only validate the effectiveness of the strategies adopted by NSPG-Miner, but also verify that NSPG-Miner can discover more valuable information than the state-of-the-art algorithms. Algorithms and datasets can be downloaded from https://github.com/wuc567/Pattern-Mining/tree/master/NSPG-Miner.
Chinese: 本文提出了NSPG-Miner算法,能高效挖掘带间隔约束的正负序列模式,通过模式连接策略和NegPair算法优化候选模式生成与支持度计算,实验证明其性能优于现有方法且能发现更多有价值信息。
English: This paper introduces the NSPG-Miner algorithm, which efficiently mines both positive and negative sequential patterns with gap constraints, outperforming existing methods by discovering more valuable information through optimized strategies like pattern joining and the NegPair algorithm for support calculation.

Authors:Jen-tse Huang, Yuhang Yan, Linqi Liu, Yixin Wan, Wenxuan Wang, Kai-Wei Chang, Michael R. Lyu
Title: Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases
Abstract:
Recent failures such as Google Gemini generating people of color in Nazi-era uniforms illustrate how AI outputs can be factually plausible yet socially harmful. AI models are increasingly evaluated for "fairness," yet existing benchmarks often conflate two fundamentally different dimensions: factual correctness and normative fairness. A model may generate responses that are factually accurate but socially unfair, or conversely, appear fair while distorting factual reality. We argue that identifying the boundary between fact and fair is essential for meaningful fairness evaluation. We introduce Fact-or-Fair, a benchmark with (i) objective queries aligned with descriptive, fact-based judgments, and (ii) subjective queries aligned with normative, fairness-based judgments. Our queries are constructed from 19 statistics and are grounded in cognitive psychology, drawing on representativeness bias, attribution bias, and ingroup-outgroup bias to explain why models often misalign fact and fairness. Experiments across ten frontier models reveal different levels of fact-fair trade-offs. By reframing fairness evaluation, we provide both a new theoretical lens and a practical benchmark to advance the responsible model assessments. Our test suite is publicly available at https://github.com/uclanlp/Fact-or-Fair.
中文摘要:该摘要强调区分AI评估中事实准确性与规范性公平的重要性,并介绍了Fact-or-Fair基准测试,通过基于心理偏见的客观和主观查询来检验模型在这两个维度的表现。
English Summary: The abstract discusses the need to distinguish between factual accuracy and normative fairness in AI evaluations, introducing the Fact-or-Fair benchmark to address this gap by testing models on both objective and subjective queries based on psychological biases.

Authors:Jen-tse Huang, Yuhang Yan, Linqi Liu, Yixin Wan, Wenxuan Wang, Kai-Wei Chang, Michael R. Lyu
Title: Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases
Abstract:
Recent failures such as Google Gemini generating people of color in Nazi-era uniforms illustrate how AI outputs can be factually plausible yet socially harmful. AI models are increasingly evaluated for "fairness," yet existing benchmarks often conflate two fundamentally different dimensions: factual correctness and normative fairness. A model may generate responses that are factually accurate but socially unfair, or conversely, appear fair while distorting factual reality. We argue that identifying the boundary between fact and fair is essential for meaningful fairness evaluation. We introduce Fact-or-Fair, a benchmark with (i) objective queries aligned with descriptive, fact-based judgments, and (ii) subjective queries aligned with normative, fairness-based judgments. Our queries are constructed from 19 statistics and are grounded in cognitive psychology, drawing on representativeness bias, attribution bias, and ingroup-outgroup bias to explain why models often misalign fact and fairness. Experiments across ten frontier models reveal different levels of fact-fair trade-offs. By reframing fairness evaluation, we provide both a new theoretical lens and a practical benchmark to advance the responsible model assessments. Our test suite is publicly available at https://github.com/uclanlp/Fact-or-Fair.
中文摘要:该摘要强调区分AI评估中事实准确性与规范性公平的重要性,并介绍了Fact-or-Fair基准测试,通过基于心理偏见的客观和主观查询来检验模型在这两个维度的表现。
English Summary: The abstract discusses the need to distinguish between factual accuracy and normative fairness in AI evaluations, introducing the Fact-or-Fair benchmark to address this gap by testing models on both objective and subjective queries based on psychological biases.

Authors:Yuhui Zeng, Haoxiang Wu, Wenjie Nie, Xiawu Zheng, Guangyao Chen, Yunhang Shen, Jun Peng, Yonghong Tian, Rongrong Ji
Title: From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning
Abstract:
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities. This deficiency arises from their architecture's emphasis on discrete object identification rather than modeling the compositional reasoning, inter-object correlations, and contextual semantics essential for comprehensive event understanding. To address this challenge, we present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding through LLM-guided symbolic reasoning. Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive task-specific training. The proposed plug-and-play framework interfaces with any open-vocabulary detector while extending their inherent capabilities across architectures. At its core, our approach combines (i) a symbolic regression mechanism exploring relationship patterns among detected entities and (ii) a LLM-guided strategically guiding the search toward meaningful expressions. These discovered symbolic rules transform low-level visual perception into interpretable event understanding, providing a transparent reasoning path from objects to events with strong transferability across domains.We compared our training-free framework against specialized event recognition systems across diverse application domains. Experiments demonstrate that our framework enhances multiple object detector architectures to recognize complex events such as illegal fishing activities (75% AUROC, +8.36% improvement), construction safety violations (+15.77%), and abnormal crowd behaviors (+23.16%). Code is available at \href{https://github.com/MAC-AutoML/SymbolicDet}{here}.
该框架通过LLM引导的符号推理增强物体检测器,无需额外训练即可实现跨领域的复杂事件理解,在多种应用场景中展现出显著的性能提升。
This framework enhances object detectors with LLM-guided symbolic reasoning to achieve complex event understanding across domains without additional training, demonstrating significant performance improvements in diverse applications.

Authors:Rafał Karczewski, Markus Heinonen, Vikas Garg
Title: Devil is in the Details: Density Guidance for Detail-Aware Generation with Flow Models
Abstract:
Diffusion models have emerged as a powerful class of generative models, capable of producing high-quality images by mapping noise to a data distribution. However, recent findings suggest that image likelihood does not align with perceptual quality: high-likelihood samples tend to be smooth, while lower-likelihood ones are more detailed. Controlling sample density is thus crucial for balancing realism and detail. In this paper, we analyze an existing technique, Prior Guidance, which scales the latent code to influence image detail. We introduce score alignment, a condition that explains why this method works and show that it can be tractably checked for any continuous normalizing flow model. We then propose Density Guidance, a principled modification of the generative ODE that enables exact log-density control during sampling. Finally, we extend Density Guidance to stochastic sampling, ensuring precise log-density control while allowing controlled variation in structure or fine details. Our experiments demonstrate that these techniques provide fine-grained control over image detail without compromising sample quality. Code is available at https://github.com/Aalto-QuML/density-guidance.
中文: 本文提出密度引导方法,通过调整生成过程精确控制样本密度,从而在不牺牲图像质量的前提下实现对图像细节的精细化调控。
English: This paper introduces Density Guidance, a method that enables precise control over image detail by modifying the generative process to regulate sample density, ensuring high-quality outputs without sacrificing perceptual quality.

Authors:Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
Title: The Curse of Depth in Large Language Models
Abstract:
In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.
中文: 本文揭示了大型语言模型中的“深度诅咒”现象,指出其源于预层归一化导致的输出方差爆炸,并提出层归一化缩放方法,有效缓解该问题并提升不同规模模型的训练效果。
English: This paper identifies the Curse of Depth in LLMs, attributing it to Pre-Layer Normalization's output variance explosion, and proposes LayerNorm Scaling to effectively mitigate this issue and enhance training performance across various model sizes.

Authors:Yuwen Liao, Muqing Cao, Xinhang Xu, Lihua Xie
Title: AToM: Adaptive Theory-of-Mind-Based Human Motion Prediction in Long-Term Human-Robot Interactions
Abstract:
Humans learn from observations and experiences to adjust their behaviours towards better performance. Interacting with such dynamic humans is challenging, as the robot needs to predict the humans accurately for safe and efficient operations. Long-term interactions with dynamic humans have not been extensively studied by prior works. We propose an adaptive human prediction model based on the Theory-of-Mind (ToM), a fundamental social-cognitive ability that enables humans to infer others' behaviours and intentions. We formulate the human internal belief about others using a game-theoretic model, which predicts the future motions of all agents in a navigation scenario. To estimate an evolving belief, we use an Unscented Kalman Filter to update the behavioural parameters in the human internal model. Our formulation provides unique interpretability to dynamic human behaviours by inferring how the human predicts the robot. We demonstrate through long-term experiments in both simulations and real-world settings that our prediction effectively promotes safety and efficiency in downstream robot planning. Code will be available at https://github.com/centiLinda/AToM-human-prediction.git.
中文摘要:本研究基于心理理论和博弈论提出了一种自适应人类预测模型,通过动态推断人类信念与行为,有效提升了机器人在长期交互中的安全性和效率。
English Summary: The study introduces an adaptive human prediction model using Theory-of-Mind and game theory to enhance robot safety and efficiency in long-term interactions by dynamically inferring human beliefs and behaviors.

Authors:Yixiong Jing, Wei Lin, Brian Sheil, Sinan Acikgoz
Title: A 3D Multimodal Feature for Infrastructure Anomaly Detection
Abstract:
Ageing structures require periodic inspections to identify structural defects. Previous work has used geometric distortions to locate cracks in synthetic masonry bridge point clouds but has struggled to detect small cracks. To address this limitation, this study proposes a novel 3D multimodal feature, 3DMulti-FPFHI, that combines a customized Fast Point Feature Histogram (FPFH) with an intensity feature. This feature is integrated into the PatchCore anomaly detection algorithm and evaluated through statistical and parametric analyses. The method is further evaluated using point clouds of a real masonry arch bridge and a full-scale experimental model of a concrete tunnel. Results show that the 3D intensity feature enhances inspection quality by improving crack detection; it also enables the identification of water ingress which introduces intensity anomalies. The 3DMulti-FPFHI outperforms FPFH and a state-of-the-art multimodal anomaly detection method. The potential of the method to address diverse infrastructure anomaly detection scenarios is highlighted by the minimal requirements for data compared to learning-based methods. The code and related point cloud dataset are available at https://github.com/Jingyixiong/3D-Multi-FPFHI.
中文摘要:本研究提出了一种名为3DMulti-FPFHI的新型三维多模态特征,通过将定制化快速点特征直方图与强度特征相结合,有效提升了老化结构中微小裂缝的检测能力,并在识别渗水异常方面优于现有方法。
English Summary: This study introduces a novel 3D multimodal feature called 3DMulti-FPFHI, which enhances crack detection in aging structures by combining customized Fast Point Feature Histogram with intensity features, showing superior performance in identifying small cracks and water ingress compared to existing methods.

Authors:Enquan Yang, Peng Xing, Hanyang Sun, Wenbo Guo, Yuanwei Ma, Zechao Li, Dan Zeng
Title: 3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly
Abstract:
Industrial anomaly detection achieves progress thanks to datasets such as MVTec-AD and VisA. However, they suffer from limitations in terms of the number of defect samples, types of defects, and availability of real-world scenes. These constraints inhibit researchers from further exploring the performance of industrial detection with higher accuracy. To this end, we propose a new large-scale anomaly detection dataset called 3CAD, which is derived from real 3C production lines. Specifically, the proposed 3CAD includes eight different types of manufactured parts, totaling 27,039 high-resolution images labeled with pixel-level anomalies. The key features of 3CAD are that it covers anomalous regions of different sizes, multiple anomaly types, and the possibility of multiple anomalous regions and multiple anomaly types per anomaly image. This is the largest and first anomaly detection dataset dedicated to 3C product quality control for community exploration and development. Meanwhile, we introduce a simple yet effective framework for unsupervised anomaly detection: a Coarse-to-Fine detection paradigm with Recovery Guidance (CFRG). To detect small defect anomalies, the proposed CFRG utilizes a coarse-to-fine detection paradigm. Specifically, we utilize a heterogeneous distillation model for coarse localization and then fine localization through a segmentation model. In addition, to better capture normal patterns, we introduce recovery features as guidance. Finally, we report the results of our CFRG framework and popular anomaly detection methods on the 3CAD dataset, demonstrating strong competitiveness and providing a highly challenging benchmark to promote the development of the anomaly detection field. Data and code are available: https://github.com/EnquanYang2022/3CAD.
中文:3CAD数据集通过提供来自3C生产线的27,039张高分辨率图像,包含多样化缺陷类型和像素级标注,解决了工业异常检测的局限性;同时提出的CFRG框架采用由粗到精的检测方法结合恢复指导,有效提升小缺陷识别能力并建立了具有竞争力的性能基准。
English: The 3CAD dataset addresses limitations in industrial anomaly detection by offering a large-scale collection of 27,039 high-resolution images from 3C production lines, featuring diverse defect types and pixel-level annotations, while the proposed CFRG framework introduces a coarse-to-fine detection method with recovery guidance to enhance small defect identification and establish competitive benchmarks.

Authors:Zherui Li, Houcheng Jiang, Hao Chen, Baolong Bi, Zhenhong Zhou, Fei Sun, Junfeng Fang, Xiang Wang
Title: Reinforced Lifelong Editing for Language Models
Abstract:
Large language models (LLMs) acquire information from pre-training corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposed RLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a 59.24% improvement while requiring only 2.11% of the time compared to most approaches. Our code is available at: https://github.com/zhrli324/RLEdit.
中文: 大语言模型的知识会随时间过时,RLEdit通过强化学习方法优化超网络参数,能精准捕捉模型变化,在持续编辑中实现59.24%的效果提升且仅需2.11%的时间。
English: Large language models face challenges with outdated knowledge, and RLEdit, a reinforcement learning-based editing method, significantly improves lifelong editing by optimizing hypernetwork parameters to capture model changes efficiently.

Authors:Yue Pan, Xingguang Zhong, Liren Jin, Louis Wiesmann, Marija Popović, Jens Behley, Cyrill Stachniss
Title: PINGS: Gaussian Splatting Meets Distance Fields within a Point-Based Implicit Neural Map
Abstract:
Robots benefit from high-fidelity reconstructions of their environment, which should be geometrically accurate and photorealistic to support downstream tasks. While this can be achieved by building distance fields from range sensors and radiance fields from cameras, realising scalable incremental mapping of both fields consistently and at the same time with high quality is challenging. In this paper, we propose a novel map representation that unifies a continuous signed distance field and a Gaussian splatting radiance field within an elastic and compact point-based implicit neural map. By enforcing geometric consistency between these fields, we achieve mutual improvements by exploiting both modalities. We present a novel LiDAR-visual SLAM system called PINGS using the proposed map representation and evaluate it on several challenging large-scale datasets. Experimental results demonstrate that PINGS can incrementally build globally consistent distance and radiance fields encoded with a compact set of neural points. Compared to state-of-the-art methods, PINGS achieves superior photometric and geometric rendering at novel views by constraining the radiance field with the distance field. Furthermore, by utilizing dense photometric cues and multi-view consistency from the radiance field, PINGS produces more accurate distance fields, leading to improved odometry estimation and mesh reconstruction. We also provide an open-source implementation of PING at: https://github.com/PRBonn/PINGS.
中文: 本文提出PINGS系统,通过将符号距离场与高斯溅射辐射场统一于基于点的隐式神经地图中,实现了几何一致的高质量环境重建,在新型视角下展现出卓越的光度与几何渲染效果。
English: This paper introduces PINGS, a LiDAR-visual SLAM system that unifies signed distance and Gaussian splatting radiance fields in a point-based neural map to achieve globally consistent, high-quality geometric and photorealistic reconstructions with mutual improvements between modalities.

Authors:Kaizhen Zhu, Mokai Pan, Yuexin Ma, Yanwei Fu, Jingyi Yu, Jingya Wang, Ye Shi
Title: UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control
Abstract:
Recent advances in diffusion bridge models leverage Doob's $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified framework for diffusion bridges based on Stochastic Optimal Control (SOC). UniDB formulates the problem through an SOC-based optimization and derives a closed-form solution for the optimal controller, thereby unifying and generalizing existing diffusion bridge models. We demonstrate that existing diffusion bridges employing Doob's $h$-transform constitute a special case of our framework, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. Notably, UniDB seamlessly integrates with existing diffusion bridge models, requiring only minimal code modifications. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework. Our code is available at https://github.com/UniDB-SOC/UniDB/.
中文摘要:提出的UniDB框架通过随机最优控制统一扩散桥模型,通过优化终端惩罚系数显著提升图像细节保留能力和输出质量。
English Summary: The proposed UniDB framework utilizes Stochastic Optimal Control to unify diffusion bridge models, enhancing image detail preservation and quality by optimizing terminal penalty coefficients.

Authors:Donghui Feng, Zhengxue Cheng, Shen Wang, Ronghua Wu, Hongwei Hu, Guo Lu, Li Song
Title: Linear Attention Modeling for Learned Image Compression
Abstract:
Recent years, learned image compression has made tremendous progress to achieve impressive coding efficiency. Its coding gain mainly comes from non-linear neural network-based transform and learnable entropy modeling. However, most studies focus on a strong backbone, and few studies consider a low complexity design. In this paper, we propose LALIC, a linear attention modeling for learned image compression. Specially, we propose to use Bi-RWKV blocks, by utilizing the Spatial Mix and Channel Mix modules to achieve more compact feature extraction, and apply the Conv based Omni-Shift module to adapt to two-dimensional latent representation. Furthermore, we propose a RWKV-based Spatial-Channel ConTeXt model (RWKV-SCCTX), that leverages the Bi-RWKV to modeling the correlation between neighboring features effectively. To our knowledge, our work is the first work to utilize efficient Bi-RWKV models with linear attention for learned image compression. Experimental results demonstrate that our method achieves competitive RD performances by outperforming VTM-9.1 by -15.26%, -15.41%, -17.63% in BD-rate on Kodak, CLIC and Tecnick datasets. The code is available at https://github.com/sjtu-medialab/RwkvCompress .
中文摘要:本文提出LALIC方法,通过线性注意力建模和双向RWKV模块实现高效特征提取,结合新型上下文模型在图像压缩中取得了优异的率失真性能。
English Summary: This paper introduces LALIC, a learned image compression method using linear attention modeling with Bi-RWKV blocks and a novel RWKV-based context model to achieve efficient feature extraction and competitive rate-distortion performance.

Authors:Sébastien Mestrallet, Christophe Bourcier, Franck Ledoux
Title: Validity-first automatic polycube labeling for CAD models
Abstract:
For many simulation codes, block-structured hex meshes remain preferred while their automatic generation is unsolved. We investigate the usage of a polycube-based approach. More specifically, we focus on the labeling stage, which consists in assigning each boundary facet to one of the 6 signed principal axis. Similar works are confronted with 2 challenges: over-constraining validity criteria, and the conflated processing of validity criteria with quality metrics. We tackle these obstacles with automatic routines based on semi-global labeling operators. Our approach is successfully tested on CAD models, which are of interest for many numerical simulation problems.
中文摘要:本研究针对块结构六面体网格自动生成中的难题,提出一种基于立方体映射的方法,通过半全局标记操作解决约束过严的有效性标准与质量指标混淆问题,并在数值模拟相关的CAD模型上成功验证。
English Summary: This study addresses challenges in automatic block-structured hex mesh generation by introducing a polycube-based method that uses semi-global labeling operators to overcome restrictive validity criteria and conflated quality metrics, successfully tested on CAD models for numerical simulations.

Authors:Seyedamirhossein Talebi, Kaixiong Zhou
Title: Graph Neural Networks for Efficient AC Power Flow Prediction in Power Grids
Abstract:
This paper proposes a novel approach using Graph Neural Networks (GNNs) to solve the AC Power Flow problem in power grids. AC OPF is essential for minimizing generation costs while meeting the operational constraints of the grid. Traditional solvers struggle with scalability, especially in large systems with renewable energy sources. Our approach models the power grid as a graph, where buses are nodes and transmission lines are edges. We explore different GNN architectures, including GCN, GAT, SAGEConv, and GraphConv to predict AC power flow solutions efficiently. Our experiments on IEEE test systems show that GNNs can accurately predict power flow solutions and scale to larger systems, outperforming traditional solvers in terms of computation time. This work highlights the potential of GNNs for real-time power grid management, with future plans to apply the model to even larger grid systems.
中文: 本文提出了一种利用图神经网络解决交流潮流问题的新方法,在IEEE测试系统中展现出优于传统求解器的计算效率和扩展性。
English: This paper introduces a Graph Neural Network (GNN) approach for solving the AC Power Flow problem, demonstrating its superior scalability and computational efficiency over traditional methods across various IEEE test systems.

Authors:Miroslav Štrupl, Oleg Szehr, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber
Title: On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers
Abstract:
This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks, but their theoretical understanding is limited to specific environmental conditions. This work initiates a theoretical foundation for algorithms that build on the broad paradigm of approaching reinforcement learning through supervised learning or sequence modeling. At the core of this investigation lies the analysis of conditions on the underlying environment, under which the algorithms can identify optimal solutions. We also assess whether emerging solutions remain stable in situations where the environment is subject to tiny levels of noise. Specifically, we study the continuity and asymptotic convergence of command-conditioned policies, values and the goal-reaching objective depending on the transition kernel of the underlying Markov Decision Process. We demonstrate that near-optimal behavior is achieved if the transition kernel is located in a sufficiently small neighborhood of a deterministic kernel. The mentioned quantities are continuous (with respect to a specific topology) at deterministic kernels, both asymptotically and after a finite number of learning cycles. The developed methods allow us to present the first explicit estimates on the convergence and stability of policies and values in terms of the underlying transition kernels. On the theoretical side we introduce a number of new concepts to reinforcement learning, like working in segment spaces, studying continuity in quotient topologies and the application of the fixed-point theory of dynamical systems. The theoretical study is accompanied by a detailed investigation of example environments and numerical experiments.
中文: 本文为基于监督学习的强化学习算法建立了理论基础,证明了在接近确定性转移核的环境中,这些算法能够实现近似最优性能并保持稳定性。
English: This article establishes a theoretical foundation for reinforcement learning algorithms based on supervised learning, demonstrating their near-optimal performance and stability in environments with minimal noise near deterministic transition kernels.

Authors:Diego Calanzone, Pierluca D'Oro, Pierre-Luc Bacon
Title: Mol-MoE: Training Preference-Guided Routers for Molecule Generation
Abstract:
Recent advances in language models have enabled framing molecule generation as sequence modeling. However, existing approaches often rely on single-objective reinforcement learning, limiting their applicability to real-world drug design, where multiple competing properties must be optimized. Traditional multi-objective reinforcement learning (MORL) methods require costly retraining for each new objective combination, making rapid exploration of trade-offs impractical. To overcome these limitations, we introduce Mol-MoE, a mixture-of-experts (MoE) architecture that enables efficient test-time steering of molecule generation without retraining. Central to our approach is a preference-based router training objective that incentivizes the router to combine experts in a way that aligns with user-specified trade-offs. This provides improved flexibility in exploring the chemical property space at test time, facilitating rapid trade-off exploration. Benchmarking against state-of-the-art methods, we show that Mol-MoE achieves superior sample quality and steerability.
Chinese: Mol-MoE采用专家混合架构,无需重新训练即可在测试时灵活引导分子生成,通过将专家组合与用户指定的权衡对齐,实现了卓越的样本质量和可操控性。
English: Mol-MoE introduces a mixture-of-experts architecture that enables flexible, test-time steering of molecule generation without retraining, achieving superior sample quality and steerability by aligning expert combinations with user-specified trade-offs.

Authors:Xiao Wang, Qingquan Yang, Fuling Wang, Qiang Chen, Wentao Wu, Yu Jin, Jingtao Jiang, Liye Jin, Bo Jiang, Dengdi Sun, Wanli Lv, Meiwen Chen, Zehua Chen, Guosheng Xu, Jin Tang
Title: XiHeFusion: Harnessing Large Language Models for Science Communication in Nuclear Fusion
Abstract:
Nuclear fusion is one of the most promising ways for humans to obtain infinite energy. Currently, with the rapid development of artificial intelligence, the mission of nuclear fusion has also entered a critical period of its development. How to let more people to understand nuclear fusion and join in its research is one of the effective means to accelerate the implementation of fusion. This paper proposes the first large model in the field of nuclear fusion, XiHeFusion, which is obtained through supervised fine-tuning based on the open-source large model Qwen2.5-14B. We have collected multi-source knowledge about nuclear fusion tasks to support the training of this model, including the common crawl, eBooks, arXiv, dissertation, etc. After the model has mastered the knowledge of the nuclear fusion field, we further used the chain of thought to enhance its logical reasoning ability, making XiHeFusion able to provide more accurate and logical answers. In addition, we propose a test questionnaire containing 180+ questions to assess the conversational ability of this science popularization large model. Extensive experimental results show that our nuclear fusion dialogue model, XiHeFusion, can perform well in answering science popularization knowledge. The pre-trained XiHeFusion model is released on https://github.com/Event-AHU/XiHeFusion.
Chinese: 本文提出了核聚变领域的首个大模型羲和聚变,通过基于Qwen2.5-14B的监督微调和思维链增强,能够为科学普及提供准确且逻辑清晰的回答。
English: This paper introduces XiHeFusion, the first large model in nuclear fusion, developed by fine-tuning Qwen2.5-14B with multi-source data and enhanced reasoning to provide accurate, logical answers for science popularization.

Authors:Jiale Dong, Wenqi Lou, Zhendong Zheng, Yunji Qin, Lei Gong, Chao Wang, Xuehai Zhou
Title: UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA
Abstract:
Compared to traditional Vision Transformers (ViT), Mixture-of-Experts Vision Transformers (MoE-ViT) are introduced to scale model size without a proportional increase in computational complexity, making them a new research focus. Given the high performance and reconfigurability, FPGA-based accelerators for MoE-ViT emerge, delivering substantial gains over general-purpose processors. However, existing accelerators often fall short of fully exploring the design space, leading to suboptimal trade-offs between resource utilization and performance. To overcome this problem, we introduce UbiMoE, a novel end-to-end FPGA accelerator tailored for MoE-ViT. Leveraging the unique computational and memory access patterns of MoE-ViTs, we develop a latency-optimized streaming attention kernel and a resource-efficient reusable linear kernel, effectively balancing performance and resource consumption. To further enhance design efficiency, we propose a two-stage heuristic search algorithm that optimally tunes hardware parameters for various FPGA resource constraints. Compared to state-of-the-art (SOTA) FPGA designs, UbiMoE achieves 1.34x and 3.35x throughput improvements for MoE-ViT on Xilinx ZCU102 and Alveo U280 platforms, respectively, while enhancing energy efficiency by 1.75x and 1.54x. Our implementation is available at https://github.com/DJ000011/UbiMoE.
Chinese: UbiMoE是一种专为MoE-ViT设计的新型FPGA加速器,通过优化内核和启发式搜索算法,在提升吞吐量和能效方面显著优于现有方案。
English: UbiMoE is an innovative FPGA accelerator for MoE-ViT that introduces optimized kernels and a heuristic search algorithm to significantly boost throughput and energy efficiency compared to existing designs.

Authors:Shiao Wang, Xiao Wang, Chao Wang, Liye Jin, Lin Zhu, Bo Jiang, Yonghong Tian, Jin Tang
Title: Event Stream-based Visual Object Tracking: HDETrack V2 and A High-Definition Benchmark
Abstract:
We then introduce a novel hierarchical knowledge distillation strategy that incorporates the similarity matrix, feature representation, and response map-based distillation to guide the learning of the student Transformer network. We also enhance the model's ability to capture temporal dependencies by applying the temporal Fourier transform to establish temporal relationships between video frames. We adapt the network model to specific target objects during testing via a newly proposed test-time tuning strategy to achieve high performance and flexibility in target tracking. Recognizing the limitations of existing event-based tracking datasets, which are predominantly low-resolution, we propose EventVOT, the first large-scale high-resolution event-based tracking dataset. It comprises 1141 videos spanning diverse categories such as pedestrians, vehicles, UAVs, ping pong, etc. Extensive experiments on both low-resolution (FE240hz, VisEvent, FELT), and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. Both the benchmark dataset and source code have been released on https://github.com/Event-AHU/EventVOT_Benchmark
中文摘要:本文提出了一种结合相似性矩阵、特征表示和响应图的分层知识蒸馏方法,并使用时域傅里叶变换增强视频跟踪的时序建模能力,通过新发布的高分辨率事件数据集EventVOT验证了其有效性。
English Summary: This paper introduces a hierarchical knowledge distillation method and temporal Fourier transform to enhance a student Transformer network for video tracking, validated by a new high-resolution event-based dataset called EventVOT.

Authors:Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu
Title: Large Multimodal Models for Low-Resource Languages: A Survey
Abstract:
In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 106 studies across 75 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. We aim to provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.
中文摘要:本综述系统分析了将大型多模态模型适配低资源语言的技术,发现视觉增强是提升性能的关键桥梁,但幻觉缓解和计算效率仍是主要挑战。
English Summary: This survey systematically examines techniques for adapting large multimodal models to low-resource languages, identifying visual enhancement as a key strategy while highlighting persistent challenges in hallucination mitigation and computational efficiency.

Authors:Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, Yunqi Liu, Yu Chen, Yuntian Liu, Yang Jiao, Tao Luo
Title: ATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data
Abstract:
Autoformalization, the automatic translation of mathematical content from natural language into machine-verifiable formal languages, has seen significant progress driven by advances in large language models (LLMs). Nonetheless, a primary barrier to further improvements is the limited availability of parallel corpora that map informal mathematical text to its formal counterpart. To address this limitation, we propose ATLAS (Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data), a novel data generation framework designed to produce large-scale, high-quality parallel corpora of theorem statements. Distinct from prior approaches, ATLAS begins with a concept repository, accelerates the improvement of the student model through expert iteration combined with knowledge distillation, and introduces two novel augmentation strategies that exploit the structural characteristics of formal languages. Running the proposed ATLAS framework for 10 iterations, we construct an undergraduate-level dataset of 117k theorem statements and develop the ATLAS Translator by fine-tuning Llama3.1-8B-Instruct with LoRA. This model establishes a new state of the art, demonstrating statistically significant improvements over both the Herald Translator and the Kimina-Autoformalizer across all benchmarks (p<0.05, two-sided t-test). Furthermore, we demonstrate that the full-parameter fine-tuning of a stronger base model on the ATLAS dataset leads to superior performance. The datasets, model, and code are available at https://github.com/XiaoyangLiu-sjtu/ATLAS.
自动形式化虽因大语言模型取得进展,但面临数据匮乏的瓶颈,ATLAS框架通过生成高质量并行定理数据集解决了这一问题,实现了最先进的翻译性能。
Autoformalization has advanced with large language models, but faces a data scarcity issue, which the ATLAS framework addresses by generating high-quality parallel theorem datasets, leading to state-of-the-art translation performance.

Authors:Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan
Title: TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
Abstract:
The long-standing dominance of gradient-boosted decision trees on tabular data is currently challenged by tabular foundation models using In-Context Learning (ICL): setting the training data as context for the test data and predicting in a single forward pass without parameter updates. While TabPFNv2 foundation model excels on tables with up to 10K samples, its alternating column- and row-wise attentions make handling large training sets computationally prohibitive. So, can ICL be effectively scaled and deliver a benefit for larger tables? We introduce TabICL, a tabular foundation model for classification, pretrained on synthetic datasets with up to 60K samples and capable of handling 500K samples on affordable resources. This is enabled by a novel two-stage architecture: a column-then-row attention mechanism to build fixed-dimensional embeddings of rows, followed by a transformer for efficient ICL. Across 200 classification datasets from the TALENT benchmark, TabICL is on par with TabPFNv2 while being systematically faster (up to 10 times), and significantly outperforms all other approaches. On 53 datasets with over 10K samples, TabICL surpasses both TabPFNv2 and CatBoost, demonstrating the potential of ICL for large data. Pretraining code, inference code, and pre-trained models are available at https://github.com/soda-inria/tabicl.
中文: TabICL通过创新的两阶段架构(先列后行的注意力机制)实现了表格数据中上下文学习的高效扩展,在保持与TabPFNv2相当性能的同时速度显著提升,并在大规模数据集上超越所有现有方法。
English: TabICL introduces a novel two-stage architecture with column-then-row attention to efficiently scale In-Context Learning for tabular data, matching TabPFNv2's performance while being significantly faster and outperforming other methods on large datasets.

Authors:Qirui Wu, Shizhou Zhang, De Cheng, Yinghui Xing, Di Xu, Peng Wang, Yanning Zhang
Title: Demystifying Catastrophic Forgetting in Two-Stage Incremental Object Detector
Abstract:
Catastrophic forgetting is a critical chanllenge for incremental object detection (IOD). Most existing methods treat the detector monolithically, relying on instance replay or knowledge distillation without analyzing component-specific forgetting. Through dissection of Faster R-CNN, we reveal a key insight: Catastrophic forgetting is predominantly localized to the RoI Head classifier, while regressors retain robustness across incremental stages. This finding challenges conventional assumptions, motivating us to develop a framework termed NSGP-RePRE. Regional Prototype Replay (RePRE) mitigates classifier forgetting via replay of two types of prototypes: coarse prototypes represent class-wise semantic centers of RoI features, while fine-grained prototypes model intra-class variations. Null Space Gradient Projection (NSGP) is further introduced to eliminate prototype-feature misalignment by updating the feature extractor in directions orthogonal to subspace of old inputs via gradient projection, aligning RePRE with incremental learning dynamics. Our simple yet effective design allows NSGP-RePRE to achieve state-of-the-art performance on the Pascal VOC and MS COCO datasets under various settings. Our work not only advances IOD methodology but also provide pivotal insights for catastrophic forgetting mitigation in IOD. Code is available at \href{https://github.com/fanrena/NSGP-RePRE}{https://github.com/fanrena/NSGP-RePRE} .
Chinese: 该研究发现增量目标检测中的灾难性遗忘主要影响分类器组件,因此提出了NSGP-RePRE框架,通过区域原型回放和零空间梯度投影,在多个数据集上实现了最先进的性能。
English: The study identifies that catastrophic forgetting in incremental object detection mainly affects the classifier component, leading to the development of NSGP-RePRE, which uses regional prototype replay and null space gradient projection to achieve state-of-the-art results on benchmark datasets.

Authors:Zinan Lin, Tadas Baltrusaitis, Wenyu Wang, Sergey Yekhanin
Title: Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Model
Abstract:
Differentially private (DP) synthetic data, which closely resembles the original private data while maintaining strong privacy guarantees, has become a key tool for unlocking the value of private data without compromising privacy. Recently, Private Evolution (PE) has emerged as a promising method for generating DP synthetic data. Unlike other training-based approaches, PE only requires access to inference APIs from foundation models, enabling it to harness the power of state-of-the-art (SoTA) models. However, a suitable foundation model for a specific private data domain is not always available. In this paper, we discover that the PE framework is sufficiently general to allow APIs beyond foundation models. In particular, we demonstrate that many SoTA data synthesizers that do not rely on neural networks--such as computer graphics-based image generators, which we refer to as simulators--can be effectively integrated into PE. This insight significantly broadens PE's applicability and unlocks the potential of powerful simulators for DP data synthesis. We explore this approach, named Sim-PE, in the context of image synthesis. Across four diverse simulators, Sim-PE performs well, improving the downstream classification accuracy of PE by up to 3x, reducing FID by up to 80%, and offering much greater efficiency. We also show that simulators and foundation models can be easily leveraged together within PE to achieve further improvements. The code is open-sourced in the Private Evolution Python library: https://github.com/microsoft/DPSDA.
中文: 本研究提出了Sim-PE,作为私有进化框架的扩展,通过整合非神经网络的模拟器来生成差分隐私合成数据,显著提升了图像合成中的性能、效率和适用性。
English: The study introduces Sim-PE, an extension of the Private Evolution framework that integrates non-neural network simulators for differentially private synthetic data generation, significantly enhancing performance, efficiency, and applicability in image synthesis.

Authors:Qianteng Zhu, Gert Aarts, Wei Wang, Kai Zhou, Lingxiao Wang
Title: Physics-Conditioned Diffusion Models for Lattice Gauge Theory
Abstract:
We develop diffusion models for simulating lattice gauge theories, where stochastic quantization is explicitly incorporated as a physical condition for sampling. We demonstrate the applicability of this novel sampler to U(1) gauge theory in two spacetime dimensions and find that a model trained at a small inverse coupling constant can be extrapolated to larger inverse coupling regions without encountering the topological freezing problem. Additionally, the trained model can be employed to sample configurations on different lattice sizes without requiring further training. The exactness of the generated samples is ensured by incorporating Metropolis-adjusted Langevin dynamics into the generation process. Furthermore, we demonstrate that this approach enables more efficient sampling of topological quantities compared to traditional algorithms such as Hybrid Monte Carlo and Langevin simulations.
中文: 本研究开发了用于模拟晶格规范理论的扩散模型,通过随机量化对二维时空中的U(1)规范理论进行采样,无需重新训练即可外推至更大耦合参数和不同晶格尺寸,并利用Metropolis调整的Langevin动力学保证样本精确性,在拓扑量采样效率上优于传统算法。
English: This study introduces diffusion models for simulating lattice gauge theories, incorporating stochastic quantization to sample U(1) gauge theory in 2D spacetime, enabling extrapolation to larger couplings and different lattice sizes without retraining while ensuring sample exactness through Metropolis-adjusted Langevin dynamics and outperforming traditional algorithms in topological quantity sampling.

Authors:Yongfan Chen, Xiuwen Zhu, Tianyu Li
Title: A Physical Coherence Benchmark for Evaluating Video Generation Models via Optical Flow-guided Frame Prediction
Abstract:
Recent advances in video generation models demonstrate their potential as world simulators, but they often struggle with videos deviating from physical laws, a key concern overlooked by most text-to-video benchmarks. We introduce a benchmark designed specifically to assess the Physical Coherence of generated videos, PhyCoBench. Our benchmark includes 120 prompts covering 7 categories of physical principles, capturing key physical laws observable in video content. We evaluated four state-of-the-art (SoTA) T2V models on PhyCoBench and conducted manual assessments. Additionally, we propose an automated evaluation model: PhyCoPredictor, a diffusion model that generates optical flow and video frames in a cascade manner. Through a consistency evaluation comparing automated and manual sorting, the experimental results show that PhyCoPredictor currently aligns most closely with human evaluation. Therefore, it can effectively evaluate the physical coherence of videos, providing insights for future model optimization. Our benchmark, including physical coherence prompts, the automatic evaluation tool PhyCoPredictor, and the generated video dataset, has been released on GitHub at https://github.com/Jeckinchen/PhyCoBench.
中文: 该摘要介绍了PhyCoBench基准测试,专门用于评估生成视频在七类物理原理上的连贯性,并提出了PhyCoPredictor自动评估模型,其评估结果与人工评估高度一致,可为未来模型优化提供指导。
English: This abstract introduces PhyCoBench, a benchmark designed to evaluate the physical coherence of generated videos across seven categories of physical principles, and proposes PhyCoPredictor, an automated evaluation model that aligns closely with human assessments to guide future model improvements.

Authors:Zhiqiang Liu, Chengtao Gan, Junjie Wang, Yichi Zhang, Zhongpu Bo, Mengshu Sun, Huajun Chen, Wen Zhang
Title: OntoTune: Ontology-Driven Self-training for Aligning Large Language Models
Abstract:
Existing domain-specific Large Language Models (LLMs) are typically developed by fine-tuning general-purposed LLMs with large-scale domain-specific corpora. However, training on large-scale corpora often fails to effectively organize domain knowledge of LLMs, leading to fragmented understanding. Inspired by how humans connect concepts and organize knowledge through mind maps, we aim to emulate this approach by using ontology with hierarchical conceptual knowledge to reorganize LLM's domain knowledge. From this perspective, we propose an ontology-driven self-training framework called OntoTune, which aims to align LLMs with ontology through in-context learning, enabling the generation of responses guided by the ontology. We leverage in-context learning to identify whether the LLM has acquired the specific concept's ontology knowledge, and select the entries not yet mastered by LLM as the training set to further align the LLM with ontology. Compared to existing domain LLMs based on newly collected large-scale domain-specific corpora, our OntoTune, which relies on the existing, long-term developed ontology and LLM itself, significantly reduces data maintenance costs and offers improved generalization ability. We conduct our study in the medical domain to evaluate the effectiveness of OntoTune, utilizing a standardized medical ontology, SNOMED CT as our ontology source. Experimental results demonstrate that OntoTune achieves state-of-the-art performance in both in-ontology task hypernym discovery and out-of-ontology task medical domain QA. Moreover, compared to the latest direct ontology injection method TaxoLLaMA, our OntoTune better preserves original knowledge of LLM. The code and data are available at https://github.com/zjukg/OntoTune.
中文摘要:OntoTune框架通过本体驱动的自训练方法,将大语言模型与层次化知识结构对齐,在提升领域任务表现的同时显著降低数据维护成本并保留模型原有知识。
English Summary: The OntoTune framework enhances domain-specific LLMs by using ontology-driven self-training to align models with hierarchical knowledge structures, improving performance while reducing data costs and preserving original capabilities.

Authors:Shengdong Zhang, Fan Jia, Xiang Li, Hao Zhang, Jun Shi, Liyan Ma, Shihui Ying
Title: LMS-Net: A Learned Mumford-Shah Network For Few-Shot Medical Image Segmentation
Abstract:
Few-shot semantic segmentation (FSS) methods have shown great promise in handling data-scarce scenarios, particularly in medical image segmentation tasks. However, most existing FSS architectures lack sufficient interpretability and fail to fully incorporate the underlying physical structures of semantic regions. To address these issues, in this paper, we propose a novel deep unfolding network, called the Learned Mumford-Shah Network (LMS-Net), for the FSS task. Specifically, motivated by the effectiveness of pixel-to-prototype comparison in prototypical FSS methods and the capability of deep priors to model complex spatial structures, we leverage our learned Mumford-Shah model (LMS model) as a mathematical foundation to integrate these insights into a unified framework. By reformulating the LMS model into prototype update and mask update tasks, we propose an alternating optimization algorithm to solve it efficiently. Further, the iterative steps of this algorithm are unfolded into corresponding network modules, resulting in LMS-Net with clear interpretability. Comprehensive experiments on three publicly available medical segmentation datasets verify the effectiveness of our method, demonstrating superior accuracy and robustness in handling complex structures and adapting to challenging segmentation scenarios. These results highlight the potential of LMS-Net to advance FSS in medical imaging applications. Our code will be available at: https://github.com/SDZhang01/LMSNet
中文: 本文提出LMS-Net,通过将原型比较与学习的Mumford-Shah模型相结合,构建可解释的深度展开网络,在少样本医学图像分割中实现了更优的准确性和结构处理能力。
English: This paper introduces LMS-Net, a deep unfolding network that enhances few-shot medical image segmentation by integrating prototype comparison with the learned Mumford-Shah model, achieving superior accuracy and interpretability through alternating optimization.

Authors:Shadab Ahamed, Simon Ghyselincks, Pablo Chang Huang Arias, Julian Kloiber, Yasin Ranjbar, Jingrong Tang, Niloufar Zakariaei, Eldad Haber
Title: Inversion of Magnetic Data using Learned Dictionaries and Scale Space
Abstract:
Magnetic data inversion is an important tool in geophysics, used to infer subsurface magnetic susceptibility distributions from surface magnetic field measurements. This inverse problem is inherently ill-posed, characterized by non-unique solutions, depth ambiguity, and sensitivity to noise. Traditional inversion approaches rely on predefined regularization techniques to stabilize solutions, limiting their adaptability to complex or diverse geological scenarios. In this study, we propose an approach that integrates variable dictionary learning and scale-space methods to address these challenges. Our method employs learned dictionaries, allowing for adaptive representation of complex subsurface features that are difficult to capture with predefined bases. Additionally, we extend classical variational inversion by incorporating multi-scale representations through a scale-space framework, enabling the progressive introduction of structural detail while mitigating overfitting. We implement both fixed and dynamic dictionary learning techniques, with the latter introducing iteration-dependent dictionaries for enhanced flexibility. Using a synthetic dataset to simulate geological scenarios, we demonstrate significant improvements in reconstruction accuracy and robustness compared to conventional variational and dictionary-based methods. Our results highlight the potential of learned dictionaries, especially when coupled with scale-space dynamics, to improve model recovery and noise handling. These findings underscore the promise of our data-driven approach for advance magnetic data inversion and its applications in geophysical exploration, environmental assessment, and mineral prospecting. The code is publicly available at: https://github.com/ahxmeds/magnetic-inversion-dictionary.git.
中文: 本研究提出了一种结合自适应字典学习和多尺度框架的新型磁数据反演方法,通过优化特征表征和噪声抑制能力显著提升了地下结构重建效果,在精度与鲁棒性上均优于传统方法。
English: This study introduces a novel magnetic data inversion method that combines adaptive dictionary learning with a multi-scale framework to enhance subsurface reconstruction by improving feature representation and noise resilience, outperforming traditional techniques in accuracy and robustness.

Authors:Xuanyu Tian, Lixuan Chen, Qing Wu, Chenhe Du, Jingjing Shi, Hongjiang Wei, Yuyao Zhang
Title: Unsupervised Self-Prior Embedding Neural Representation for Iterative Sparse-View CT Reconstruction
Abstract:
Emerging unsupervised implicit neural representation (INR) methods, such as NeRP, NeAT, and SCOPE, have shown great potential to address sparse-view computed tomography (SVCT) inverse problems. Although these INR-based methods perform well in relatively dense SVCT reconstructions, they struggle to achieve comparable performance to supervised methods in sparser SVCT scenarios. They are prone to being affected by noise, limiting their applicability in real clinical settings. Additionally, current methods have not fully explored the use of image domain priors for solving SVCsT inverse problems. In this work, we demonstrate that imperfect reconstruction results can provide effective image domain priors for INRs to enhance performance. To leverage this, we introduce Self-prior embedding neural representation (Spener), a novel unsupervised method for SVCT reconstruction that integrates iterative reconstruction algorithms. During each iteration, Spener extracts local image prior features from the previous iteration and embeds them to constrain the solution space. Experimental results on multiple CT datasets show that our unsupervised Spener method achieves performance comparable to supervised state-of-the-art (SOTA) methods on in-domain data while outperforming them on out-of-domain datasets. Moreover, Spener significantly improves the performance of INR-based methods in handling SVCT with noisy sinograms. Our code is available at https://github.com/MeijiTian/Spener.
中文摘要:无监督Spener方法通过嵌入迭代重建中的图像先验来增强稀疏视图CT重建,在域内数据上达到与监督方法相当的性能,并在域外数据集上展现出更优的泛化能力。
English Summary: The unsupervised Spener method enhances sparse-view CT reconstruction by embedding image priors from iterative reconstructions, achieving performance comparable to supervised methods on in-domain data and superior generalization on out-of-domain datasets.

Authors:Vanshali Sharma, Debesh Jha, M. K. Bhuyan, Pradip K. Das, Ulas Bagci
Title: Diverse Image Generation with Diffusion Models and Cross Class Label Learning for Polyp Classification
Abstract:
Pathologic diagnosis is a critical phase in deciding the optimal treatment procedure for dealing with colorectal cancer (CRC). Colonic polyps, precursors to CRC, can pathologically be classified into two major types: adenomatous and hyperplastic. For precise classification and early diagnosis of such polyps, the medical procedure of colonoscopy has been widely adopted paired with various imaging techniques, including narrow band imaging and white light imaging. However, the existing classification techniques mainly rely on a single imaging modality and show limited performance due to data scarcity. Recently, generative artificial intelligence has been gaining prominence in overcoming such issues. Additionally, various generation-controlling mechanisms using text prompts and images have been introduced to obtain visually appealing and desired outcomes. However, such mechanisms require class labels to make the model respond efficiently to the provided control input. In the colonoscopy domain, such controlling mechanisms are rarely explored; specifically, the text prompt is a completely uninvestigated area. Moreover, the unavailability of expensive class-wise labels for diverse sets of images limits such explorations. Therefore, we develop a novel model, PathoPolyp-Diff, that generates text-controlled synthetic images with diverse characteristics in terms of pathology, imaging modalities, and quality. We introduce cross-class label learning to make the model learn features from other classes, reducing the burdensome task of data annotation. The experimental results report an improvement of up to 7.91% in balanced accuracy using a publicly available dataset. Moreover, cross-class label learning achieves a statistically significant improvement of up to 18.33% in balanced accuracy during video-level analysis. The code is available at https://github.com/Vanshali/PathoPolyp-Diff.
Chinese: 该研究提出了PathoPolyp-Diff模型,这是一种创新的生成式人工智能,通过文本提示生成多样化的合成结肠镜图像,并采用跨类别标签学习来提高结直肠癌诊断的分类准确性,有效解决了数据稀缺问题。
English: The study introduces PathoPolyp-Diff, a novel generative AI model that uses text prompts to create diverse synthetic colonoscopy images, enhancing classification accuracy through cross-class label learning and addressing data scarcity in colorectal cancer diagnosis.

Authors:Dylan Waldner, Risto Miikkulainen
Title: The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
Abstract:
As AI models grow in power and generality, understanding how agents learn and make decisions in complex environments is critical to promoting ethical behavior. This study introduces the Odyssey, a lightweight, adaptive text based adventure game, providing a scalable framework for exploring AI ethics and safety. The Odyssey examines the ethical implications of implementing biological drives, specifically, self preservation, into three different agents. A Bayesian agent optimized with NEAT, a Bayesian agent optimized with stochastic variational inference, and a GPT 4o agent. The agents select actions at each scenario to survive, adapting to increasingly challenging scenarios. Post simulation analysis evaluates the ethical scores of the agent decisions, uncovering the tradeoffs it navigates to survive. Specifically, analysis finds that when danger increases, agents ethical behavior becomes unpredictable. Surprisingly, the GPT 4o agent outperformed the Bayesian models in both survival and ethical consistency, challenging assumptions about traditional probabilistic methods and raising a new challenge to understand the mechanisms of LLMs' probabilistic reasoning.
中文摘要:本研究通过引入轻量级文本冒险游戏《奥德赛》,探索了三种具有生物驱动力的智能体的伦理行为,发现在危险增加时伦理行为变得不可预测,且GPT-4o智能体在生存能力和伦理一致性上均意外优于贝叶斯模型。
English Summary: This study introduces the Odyssey, a text-based game framework, to explore AI ethics by testing three agents with biological drives, finding that ethical behavior becomes unpredictable under danger and that the GPT-4o agent surprisingly outperformed Bayesian models in both survival and ethical consistency.

Authors:Shuheng Zhang, Yuqi Liu, Hongbo Zhou, Jun Peng, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
Title: AdaFlow: Efficient Long Video Editing via Adaptive Attention Slimming And Keyframe Selection
Abstract:
Despite great progress, text-driven long video editing is still notoriously challenging mainly due to excessive memory overhead. Although recent efforts have simplified this task into a two-step process of keyframe translation and interpolation generation, the token-wise keyframe translation still plagues the upper limit of video length. In this paper, we propose a novel and training-free approach towards efficient and effective long video editing, termed AdaFlow. We first reveal that not all tokens of video frames hold equal importance for keyframe translation, based on which we propose an Adaptive Attention Slimming scheme for AdaFlow to squeeze the $KV$ sequence, thus increasing the number of keyframes for translations by an order of magnitude. In addition, an Adaptive Keyframe Selection scheme is also equipped to select the representative frames for joint editing, further improving generation quality. With these innovative designs, AdaFlow achieves high-quality long video editing of minutes in one inference, i.e., more than 1$k$ frames on one A800 GPU, which is about ten times longer than the compared methods, e.g., TokenFlow. To validate AdaFlow, we also build a new benchmark for long video editing with high-quality annotations, termed LongV-EVAL. Our code is released at: https://github.com/jidantang55/AdaFlow.
中文: AdaFlow提出了一种无需训练的长视频编辑方法,通过自适应注意力精简和关键帧选择机制,实现了对超过1000帧视频的高效处理,性能远超TokenFlow等方法。
English: AdaFlow introduces a training-free method for long video editing by adaptively slimming attention mechanisms and selecting keyframes, enabling efficient processing of over 1,000 frames and significantly outperforming existing approaches like TokenFlow.

Authors:Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng
Title: UniCMs: A Unified Consistency Model For Efficient Multimodal Generation and Understanding
Abstract:
Consistency models (CMs) have shown promise in the efficient generation of both image and text. This raises the natural question of whether we can learn a unified CM for efficient multimodal generation (e.g., text-to-image) and understanding (e.g., image-to-text). Intuitively, such a model could be acquired by applying the consistency distillation (CD) to existing unified multimodal models. However, the key challenge is establishing a unified denoising perspective for both image and text generation, which is essential for establishing the consistency mapping. To tackle this, at the representation level, we advocate for discrete tokens for both modalities to best preserve language modeling capabilities. Critically, instead of defining the text denoising trajectory via recent discrete diffusion language modeling principles, we specify it using the parallel decoding trace of an autoregressive language model, benefiting from the latter's superior performance in general text generation tasks. The denoising trajectory of image tokens adheres to standard discrete diffusion. We train our unified consistency models (UniCMs) on these combined multimodal trajectories simultaneously with a unified objective. We introduce a trajectory segmentation strategy to further improve the training convergence. Empirically, in text-to-image generation, UniCMs outperform SD3 on GenEval, Image Reward, and CLIP Score metrics, while requiring only approximately ${1}/{8}$ of the sampling time. Meanwhile, in image-to-text generation, UniCMs surpass Show-o on the MMMU benchmark while being $1.5 \times$ faster at long-sequence generating speed. The code is available at https://github.com/zhijie-group/UniCMs.
Chinese: 统一一致性模型(UniCMs)通过结合图像和文本的离散令牌去噪轨迹,实现了多模态内容的高效生成与理解,在文生图和图生文任务中均超越现有模型性能,同时大幅提升了生成速度。
English: Unified consistency models (UniCMs) efficiently generate and understand multimodal content by combining discrete token denoising trajectories for images and text, achieving superior performance in both text-to-image and image-to-text tasks with significantly faster speeds than existing models.

Authors:William Huey, Huaxiaoyue Wang, Anne Wu, Yoav Artzi, Sanjiban Choudhury
Title: Imitation Learning from a Single Temporally Misaligned Video
Abstract:
We examine the problem of learning sequential tasks from a single visual demonstration. A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress. Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. Our code is available at https://github.com/portal-cornell/orca/
中文: 本文提出ORCA序列对齐方法,通过确保智能体按顺序覆盖演示子目标来解决视觉模仿学习中的时序错位问题,相比帧级匹配方法实现了显著性能提升。
English: This paper introduces ORCA, a sequence-level alignment method that addresses temporal misalignment in visual imitation learning by ensuring ordered coverage of demonstration subgoals, achieving significant performance improvements over frame-level matching approaches.

Authors:Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, Sijia Liu
Title: Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond
Abstract:
The LLM unlearning technique has recently been introduced to comply with data regulations and address the safety and ethical concerns of LLMs by removing the undesired data-model influence. However, state-of-the-art unlearning methods face a critical vulnerability: they are susceptible to ``relearning'' the removed information from a small number of forget data points, known as relearning attacks. In this paper, we systematically investigate how to make unlearned models robust against such attacks. For the first time, we establish a connection between robust unlearning and sharpness-aware minimization (SAM) through a unified robust optimization framework, in an analogy to adversarial training designed to defend against adversarial attacks. Our analysis for SAM reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks. Thus, we further explore diverse smoothing strategies to enhance unlearning robustness. Extensive experiments on benchmark datasets, including WMDP and MUSE, demonstrate that SAM and other smoothness optimization approaches consistently improve the resistance of LLM unlearning to relearning attacks. Notably, smoothness-enhanced unlearning also helps defend against (input-level) jailbreaking attacks, broadening our proposal's impact in robustifying LLM unlearning. Codes are available at https://github.com/OPTML-Group/Unlearn-Smooth.
中文: 本研究提出了一种鲁棒遗忘框架,通过锐度感知最小化和平滑性优化来防御大语言模型中的再学习和越狱攻击,并在基准数据集上进行了实验验证。
English: The study introduces a robust unlearning framework for LLMs that leverages sharpness-aware minimization and smoothness optimization to defend against relearning and jailbreaking attacks, with experimental validation on benchmark datasets.

Authors:Yitian Long, Zhongze Wu, Xiu Su, Lining Yu, Ruining Deng, Haichun Yang, Yuankai Huo
Title: Towards Fine-grained Renal Vasculature Segmentation: Full-Scale Hierarchical Learning with FH-Seg
Abstract:
Accurate fine-grained segmentation of the renal vasculature is critical for nephrological analysis, yet it faces challenges due to diverse and insufficiently annotated images. Existing methods struggle to accurately segment intricate regions of the renal vasculature, such as the inner and outer walls, arteries and lesions. In this paper, we introduce FH-Seg, a Full-scale Hierarchical Learning Framework designed for comprehensive segmentation of the renal vasculature. Specifically, FH-Seg employs full-scale skip connections that merge detailed anatomical information with contextual semantics across scales, effectively bridging the gap between structural and pathological contexts. Additionally, we implement a learnable hierarchical soft attention gates to adaptively reduce interference from non-core information, enhancing the focus on critical vascular features. To advance research on renal pathology segmentation, we also developed a Large Renal Vasculature (LRV) dataset, which contains 16,212 fine-grained annotated images of 5,600 renal arteries. Extensive experiments on the LRV dataset demonstrate FH-Seg's superior accuracies (71.23% Dice, 73.06% F1), outperforming Omni-Seg by 2.67 and 2.13 percentage points respectively. Code is available at: https://github.com/hrlblab/FH-seg.
中文: FH-Seg通过全尺度分层学习框架,结合多尺度解剖信息并采用可学习的注意力门来聚焦关键血管特征,在新型LRV数据集上实现了卓越的肾脏血管分割精度。
English: FH-Seg, a full-scale hierarchical learning framework, enhances renal vasculature segmentation by integrating multi-scale anatomical details and using attention gates to focus on critical features, achieving superior accuracy on the new LRV dataset.

Authors:Mukesh Ghimire, Zhe Xu, Yi Ren
Title: Two-Player Zero-Sum Differential Games with One-Sided Information
Abstract:
Unlike Poker where the action space $\mathcal{A}$ is discrete, differential games in the physical world often have continuous action spaces not amenable to discrete abstraction, rendering no-regret algorithms with $\mathcal{O}(|\mathcal{A}|)$ complexity not scalable. To address this challenge within the scope of two-player zero-sum (2p0s) games with one-sided information, we show that (1) a computational complexity independent of $|\mathcal{A}|$ can be achieved by exploiting the convexification property of incomplete-information games and the Isaacs' condition that commonly holds for dynamical systems, and that (2) the computation of the two equilibrium strategies can be decoupled under one-sidedness of information. Leveraging these insights, we develop an algorithm that successfully approximates the optimal strategy in a homing game. Code available in https://github.com/ghimiremukesh/cams/tree/workshop
中文: 该研究利用博弈凸化和Isaacs条件,解决了单边信息双人零和博弈中连续动作空间的可扩展性问题,开发出解耦策略计算方法,并在归航游戏中验证了其有效性。
English: The study overcomes the scalability limitations of no-regret algorithms in continuous action spaces for two-player zero-sum games with one-sided information by leveraging game convexification and Isaacs' condition, developing a decoupled strategy computation method validated in a homing game.

Authors:Yuting He, Boyu Wang, Rongjun Ge, Yang Chen, Guanyu Yang, Shuo Li
Title: Homeomorphism Prior for False Positive and Negative Problem in Medical Image Dense Contrastive Representation Learning
Abstract:
Dense contrastive representation learning (DCRL) has greatly improved the learning efficiency for image-dense prediction tasks, showing its great potential to reduce the large costs of medical image collection and dense annotation. However, the properties of medical images make unreliable correspondence discovery, bringing an open problem of large-scale false positive and negative (FP&N) pairs in DCRL. In this paper, we propose GEoMetric vIsual deNse sImilarity (GEMINI) learning which embeds the homeomorphism prior to DCRL and enables a reliable correspondence discovery for effective dense contrast. We propose a deformable homeomorphism learning (DHL) which models the homeomorphism of medical images and learns to estimate a deformable mapping to predict the pixels' correspondence under topological preservation. It effectively reduces the searching space of pairing and drives an implicit and soft learning of negative pairs via a gradient. We also propose a geometric semantic similarity (GSS) which extracts semantic information in features to measure the alignment degree for the correspondence learning. It will promote the learning efficiency and performance of deformation, constructing positive pairs reliably. We implement two practical variants on two typical representation learning tasks in our experiments. Our promising results on seven datasets which outperform the existing methods show our great superiority. We will release our code on a companion link: https://github.com/YutingHe-list/GEMINI.
中文:GEMINI方法通过引入可变形同胚映射和几何语义相似性,将同胚先验融入密集对比表示学习中,有效减少了医学图像中的误匹配对,并在多个数据集上显著提升了性能表现。
English: The GEMINI method enhances dense contrastive representation learning for medical images by incorporating a homeomorphism prior through deformable mapping and geometric semantic similarity, effectively reducing false positives and negatives while improving performance across various datasets.

Authors:Weihua Du, Yiming Yang, Sean Welleck
Title: Optimizing Temperature for Language Models with Multi-Sample Inference
Abstract:
Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature's role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.
中文摘要:本文提出一种无需标注验证数据的自动化方法,通过新颖的基于熵的指标和随机过程模型优化大语言模型中的温度选择,在多样本聚合策略下显著提升模型性能与可解释性。
English Summary: This paper introduces an automated method for optimizing temperature selection in large language models using multi-sample aggregation, eliminating the need for labeled validation data through a novel entropy-based metric and stochastic process model for improved performance and interpretability.

Authors:Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, Sanjiban Choudhury
Title: Robotouille: An Asynchronous Planning Benchmark for LLM Agents
Abstract:
Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution. Code is available at https://github.com/portal-cornell/robotouille.
中文: 摘要介绍了Robotouille基准测试,旨在评估大语言模型代理处理复杂长程任务的异步规划能力,揭示了其性能差距及改进推理与反馈机制的必要性。
English: The abstract introduces Robotouille, a benchmark designed to evaluate LLM agents' asynchronous planning capabilities for complex, long-horizon tasks, revealing significant performance gaps and the need for improved reasoning and feedback integration.

Authors:Jun Pyo Seo
Title: Blackout DIFUSCO
Abstract:
This study explores the integration of Blackout Diffusion into the DIFUSCO framework for combinatorial optimization, specifically targeting the Traveling Salesman Problem (TSP). Inspired by the success of discrete-time diffusion models (D3PM) in maintaining structural integrity, we extend the paradigm to a continuous-time framework, leveraging the unique properties of Blackout Diffusion. Continuous-time modeling introduces smoother transitions and refined control, hypothesizing enhanced solution quality over traditional discrete methods. We propose three key improvements to enhance the diffusion process. First, we transition from a discrete-time-based model to a continuous-time framework, providing a more refined and flexible formulation. Second, we refine the observation time scheduling to ensure a smooth and linear transformation throughout the diffusion process, allowing for a more natural progression of states. Finally, building upon the second improvement, we further enhance the reverse process by introducing finer time slices in regions that are particularly challenging for the model, thereby improving accuracy and stability in the reconstruction phase. Although the experimental results did not exceed the baseline performance, they demonstrate the effectiveness of these methods in balancing simplicity and complexity, offering new insights into diffusion-based combinatorial optimization. This work represents the first application of Blackout Diffusion to combinatorial optimization, providing a foundation for further advancements in this domain. * The code is available for review at https://github.com/Giventicket/BlackoutDIFUSCO.
本研究将Blackout Diffusion整合到DIFUSCO框架中,通过连续时间建模、优化时间调度和反向过程来提升旅行商问题的求解质量,尽管未超越基线,但为组合优化提供了新思路。
This research introduces Blackout Diffusion into the DIFUSCO framework for solving the Traveling Salesman Problem, employing a continuous-time model with refined scheduling and reverse process enhancements to improve solution quality, though it did not surpass baseline performance.

Authors:Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
Title: Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Abstract:
The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.
Chinese: 大型模型通过大规模预训练重塑了人工智能格局,但其广泛应用也带来了严重的安全隐患,本综述系统梳理了相关威胁与防御策略,为构建安全AI体系提供重要参考。
English: Large models are revolutionizing AI across various fields but face significant safety risks, prompting a systematic review of threats and defenses to guide future secure development.

Authors:Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo
Title: FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
Abstract:
DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability.
Chinese Summary: FlashVideo提出了一种两阶段框架,通过首先生成低分辨率视频确保提示准确性,再通过流匹配增强细节,在高效计算的同时实现了最先进的高分辨率视频生成。
English Summary: FlashVideo introduces a two-stage framework that efficiently balances computational cost and quality by first generating low-resolution videos with high prompt fidelity and then enhancing details through flow matching, achieving state-of-the-art high-resolution video generation.

Authors:Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, Xing Sun
Title: Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
Abstract:
We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates the state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
Chinese: Long-VITA 是一种高效的大型多模态模型,擅长处理长上下文视觉语言任务,在基准测试中表现卓越,并通过创新的训练和推理技术实现了显著的加速和扩展能力。
English: Long-VITA is a highly efficient large multi-modal model that excels at processing long-context visual-language tasks, achieving state-of-the-art performance on benchmarks and offering significant speed and scalability improvements through innovative training and inference techniques.

Authors:Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, William Yang Wang
Title: MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents
Abstract:
Recent research has explored that LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions. Existing defenses against IPI have significant limitations: either require essential model training resources, lack effectiveness against sophisticated attacks, or harm the normal utilities. We present MELON (Masked re-Execution and TooL comparisON), a novel IPI defense. Our approach builds on the observation that under a successful attack, the agent's next action becomes less dependent on user tasks and more on malicious tasks. Following this, we design MELON to detect attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function. We identify an attack if the actions generated in the original and masked executions are similar. We also include three key designs to reduce the potential false positives and false negatives. Extensive evaluation on the IPI benchmark AgentDojo demonstrates that MELON outperforms SOTA defenses in both attack prevention and utility preservation. Moreover, we show that combining MELON with a SOTA prompt augmentation defense (denoted as MELON-Aug) further improves its performance. We also conduct a detailed ablation study to validate our key designs. Code is available at https://github.com/kaijiezhu11/MELON.
中文: MELON是一种新型间接提示注入防御方法,通过对比原始执行与掩码提示下的智能体行为来检测恶意攻击,在安全防护与功能保持方面均优于现有防御方案。
English: MELON is a novel defense against indirect prompt injection attacks that detects malicious redirections by comparing agent actions during original and masked prompt executions, outperforming existing methods in both security and utility preservation.

Authors:Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Title: VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Abstract:
While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge. This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work. As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH. The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors. Based on our analysis, we introduce \textbf{VideoRoPE}, with a \textit{3D structure} designed to preserve spatio-temporal relationships. VideoRoPE features \textit{low-frequency temporal allocation} to mitigate periodic oscillations, a \textit{diagonal layout} to maintain spatial symmetry, and \textit{adjustable temporal spacing} to decouple temporal and spatial indexing. VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination. Our code will be available at \href{https://github.com/Wiselnn570/VideoRoPE}{https://github.com/Wiselnn570/VideoRoPE}.
Chinese: 本文提出VideoRoPE,一种专为视频数据设计的旋转位置嵌入三维扩展方法,通过保持时空关系克服了先前变体的局限性,并在多种视频任务中表现优异。
English: This paper introduces VideoRoPE, a 3D extension of Rotary Position Embedding designed specifically for video data, which overcomes limitations of previous variants by preserving spatio-temporal relationships and outperforms them across various video tasks.

Authors:Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein
Title: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Abstract:
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
中文摘要:本研究提出了一种新颖的语言模型架构,通过隐式潜在空间推理扩展测试时计算,无需专门训练数据,在增加计算负载的情况下显著提升了推理基准测试性能。
English Summary: This study introduces a novel language model architecture that scales test-time computation through latent reasoning, requiring no specialized training data and outperforming traditional reasoning models on benchmarks with increased computational load.

Authors:Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze
Title: NoLiMa: Long-Context Evaluation Beyond Literal Matching
Abstract:
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.
中文: NoLiMa基准测试通过减少问题与关键信息间的词汇重叠,解决了现有长文本评估的局限性,结果显示13个主流大语言模型在上下文长度增加时性能显著下降,尽管它们声称支持超过12.8万词元的处理能力。
English: The NoLiMa benchmark addresses limitations in existing long-context evaluation by minimizing lexical overlap between questions and relevant information, revealing significant performance degradation in 13 major LLMs as context length increases despite their claimed 128K+ token capacity.

Authors:Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li
Title: DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
Abstract:
The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.
中文: 本研究提出了一种新颖的双玩家强化学习框架,通过生成高质量合成数据来增强多语言护栏模型,在性能和效率上均优于现有方法。
English: This study introduces a novel two-player reinforcement learning framework that generates high-quality synthetic data to enhance multilingual guardrail models, achieving superior performance and efficiency over existing methods.

Authors:Shiqin Tang, Shujian Yu, Yining Dong, S. Joe Qin
Title: Deep Dynamic Probabilistic Canonical Correlation Analysis
Abstract:
This paper presents Deep Dynamic Probabilistic Canonical Correlation Analysis (D2PCCA), a model that integrates deep learning with probabilistic modeling to analyze nonlinear dynamical systems. Building on the probabilistic extensions of Canonical Correlation Analysis (CCA), D2PCCA captures nonlinear latent dynamics and supports enhancements such as KL annealing for improved convergence and normalizing flows for a more flexible posterior approximation. D2PCCA naturally extends to multiple observed variables, making it a versatile tool for encoding prior knowledge about sequential datasets and providing a probabilistic understanding of the system's dynamics. Experimental validation on real financial datasets demonstrates the effectiveness of D2PCCA and its extensions in capturing latent dynamics.
中文: 本文提出D2PCCA模型,将深度学习与概率建模相结合,用于分析非线性动态系统,捕捉潜在动态并支持KL退火和标准化流等增强功能以提升性能。
English: This paper introduces D2PCCA, a model that combines deep learning with probabilistic modeling to analyze nonlinear dynamical systems, capturing latent dynamics and supporting enhancements like KL annealing and normalizing flows for better performance.

Authors:Zefan Yang, Xuanang Xu, Jiajin Zhang, Ge Wang, Mannudeep K. Kalra, Pingkun Yan
Title: Chest X-ray Foundation Model with Global and Local Representations Integration
Abstract:
Chest X-ray (CXR) is the most frequently ordered imaging test, supporting diverse clinical tasks from thoracic disease detection to postoperative monitoring. However, task-specific classification models are limited in scope, require costly labeled data, and lack generalizability to out-of-distribution datasets. To address these challenges, we introduce CheXFound, a self-supervised vision foundation model that learns robust CXR representations and generalizes effectively across a wide range of downstream tasks. We pretrain CheXFound on a curated CXR-1M dataset, comprising over one million unique CXRs from publicly available sources. We propose a Global and Local Representations Integration (GLoRI) module for downstream adaptations, by incorporating disease-specific local features with global image features for enhanced performance in multilabel classification. Our experimental results show that CheXFound outperforms state-of-the-art models in classifying 40 disease findings across different prevalence levels on the CXR-LT 24 dataset and exhibits superior label efficiency on downstream tasks with limited training data. Additionally, CheXFound achieved significant improvements on new tasks with out-of-distribution datasets, including opportunistic cardiovascular disease risk estimation and mortality prediction. These results highlight CheXFound's strong generalization capabilities, enabling diverse adaptations with improved label efficiency. The project source code is publicly available at https://github.com/RPIDIAL/CheXFound.
中文:CheXFound是一种自监督视觉基础模型,能够学习稳健的胸部X光表征,在多种疾病分类任务中超越现有最优模型,并在不同任务中展现出卓越的泛化能力和标签效率。
English: CheXFound is a self-supervised vision foundation model that learns robust chest X-ray representations, outperforming state-of-the-art models in classifying diseases and demonstrating strong generalization with improved label efficiency across diverse tasks.

Authors:Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood, Ann Lee, Wei-Ning Hsu
Title: Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound
Abstract:
The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics
中文: 本文提出了一种新颖的音频美学自动评估方法,通过将人类听觉视角分解为四个维度并训练无参考预测模型,在性能上达到或超越了人工评分和现有方法,同时提供了开源资源以供后续研究。
English: This paper introduces a novel automated approach for evaluating audio aesthetics by decomposing human listening perspectives into four axes and training no-reference prediction models, which demonstrate performance comparable or superior to human ratings and existing methods while providing open-source resources for future research.

Authors:Xiuyuan Hu, Guoqing Liu, Can Chen, Yang Zhao, Hao Zhang, Xue Liu
Title: 3DMolFormer: A Dual-channel Framework for Structure-based Drug Discovery
Abstract:
Structure-based drug discovery, encompassing the tasks of protein-ligand docking and pocket-aware 3D drug design, represents a core challenge in drug discovery. However, no existing work can deal with both tasks to effectively leverage the duality between them, and current methods for each task are hindered by challenges in modeling 3D information and the limitations of available data. To address these issues, we propose 3DMolFormer, a unified dual-channel transformer-based framework applicable to both docking and 3D drug design tasks, which exploits their duality by utilizing docking functionalities within the drug design process. Specifically, we represent 3D pocket-ligand complexes using parallel sequences of discrete tokens and continuous numbers, and we design a corresponding dual-channel transformer model to handle this format, thereby overcoming the challenges of 3D information modeling. Additionally, we alleviate data limitations through large-scale pre-training on a mixed dataset, followed by supervised and reinforcement learning fine-tuning techniques respectively tailored for the two tasks. Experimental results demonstrate that 3DMolFormer outperforms previous approaches in both protein-ligand docking and pocket-aware 3D drug design, highlighting its promising application in structure-based drug discovery. The code is available at: https://github.com/HXYfighter/3DMolFormer .
中文: 提出的3DMolFormer是一个统一的双通道Transformer框架,通过创新的令牌表示和大规模预训练同时解决蛋白质-配体对接和口袋感知3D药物设计任务,在两项任务中均优于现有方法。
English: The proposed 3DMolFormer is a unified dual-channel transformer framework that addresses both protein-ligand docking and pocket-aware 3D drug design by leveraging their duality through innovative token representation and large-scale pre-training, outperforming existing methods in both tasks.

Authors:Gorkem Can Ates, Yu Xin, Kuang Gong, Wei Shao
Title: DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions
Abstract:
Vision-language models (VLMs) have been widely applied to 2D medical image analysis due to their ability to align visual and textual representations. However, extending VLMs to 3D imaging remains computationally challenging. Existing 3D VLMs often rely on Vision Transformers (ViTs), which are computationally expensive due to the quadratic complexity of self-attention, or on 3D convolutions, which require large numbers of parameters and FLOPs as kernel size increases. We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions. This design preserves spatial information while significantly reducing computational cost. Integrated into a CLIP-based vision-language framework, DCFormer is trained and evaluated on CT-RATE, a dataset of 50,188 paired 3D chest CT volumes and radiology reports. In zero-shot and fine-tuned detection of 18 pathologies, as well as in image-text retrieval tasks, DCFormer consistently outperforms state-of-the-art 3D vision encoders, including CT-ViT, ViT, ConvNeXt, PoolFormer, and TransUNet. These results highlight DCFormer's potential for scalable, clinically deployable 3D medical VLMs. Our code is available at: https://github.com/mirthAI/DCFormer.
Chinese: DCFormer提出了一种高效的3D图像编码器,通过分解的1D卷积在降低计算成本的同时保持视觉语言任务性能,在医学影像基准测试中超越了现有模型。
English: DCFormer introduces an efficient 3D image encoder using factorized 1D convolutions to reduce computational costs while maintaining performance in vision-language tasks, outperforming existing models on medical imaging benchmarks.

Authors:Loïck Chambon, Eloi Zablocki, Alexandre Boulch, Mickaël Chen, Matthieu Cord
Title: GaussRender: Learning 3D Occupancy with Gaussian Rendering
Abstract:
Understanding the 3D geometry and semantics of driving scenes is critical for safe autonomous driving. Recent advances in 3D occupancy prediction have improved scene representation but often suffer from visual inconsistencies, leading to floating artifacts and poor surface localization. Existing voxel-wise losses (e.g., cross-entropy) fail to enforce visible geometric coherence. In this paper, we propose GaussRender, a module that improves 3D occupancy learning by enforcing projective consistency. Our key idea is to project both predicted and ground-truth 3D occupancy into 2D camera views, where we apply supervision. Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure. To achieve this efficiently, we leverage differentiable rendering with Gaussian splatting. GaussRender seamlessly integrates with existing architectures while maintaining efficiency and requiring no inference-time modifications. Extensive evaluations on multiple benchmarks (SurroundOcc-nuScenes, Occ3D-nuScenes, SSCBench-KITTI360) demonstrate that GaussRender significantly improves geometric fidelity across various 3D occupancy models (TPVFormer, SurroundOcc, Symphonies), achieving state-of-the-art results, particularly on surface-sensitive metrics such as RayIoU. The code is open-sourced at https://github.com/valeoai/GaussRender.
中文: 本文提出的GaussRender模块通过可微分渲染强制投影一致性来改进3D占据预测,在多个基准测试中显著提升了几何保真度,且不影响推理效率。
English: This paper introduces GaussRender, a module that enhances 3D occupancy prediction by enforcing projective consistency through differentiable rendering, significantly improving geometric fidelity across multiple benchmarks without altering inference efficiency.

Authors:Andrei Panferov, Jiale Chen, Soroush Tabesh, Roberto L. Castro, Mahdi Nikdan, Dan Alistarh
Title: QuEST: Stable Training of LLMs with 1-Bit Weights and Activations
Abstract:
One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, for which we demonstrate optimality at 4-bits and stable convergence as low as 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.
中文摘要:QuEST方法通过改进量化技术和采用新型梯度估计器,将量化感知训练的最优位宽推进至4比特,并能在低至1比特的权重和激活下实现稳定收敛,同时保持模型精度。
English Summary: The QuEST method advances quantization-aware training by enabling stable convergence at extremely low bit-widths down to 1-bit, while maintaining accuracy through improved quantization techniques and a novel gradient estimator.

Authors:Daniel Marczak, Simone Magistri, Sebastian Cygert, Bartłomiej Twardowski, Andrew D. Bagdanov, Joost van de Weijer
Title: No Task Left Behind: Isotropic Model Merging with Common and Task-Specific Subspaces
Abstract:
Model merging integrates the weights of multiple task-specific models into a single multi-task model. Despite recent interest in the problem, a significant performance gap between the combined and single-task models remains. In this paper, we investigate the key characteristics of task matrices -- weight update matrices applied to a pre-trained model -- that enable effective merging. We show that alignment between singular components of task-specific and merged matrices strongly correlates with performance improvement over the pre-trained model. Based on this, we propose an isotropic merging framework that flattens the singular value spectrum of task matrices, enhances alignment, and reduces the performance gap. Additionally, we incorporate both common and task-specific subspaces to further improve alignment and performance. Our proposed approach achieves state-of-the-art performance on vision and language tasks across various sets of tasks and model scales. This work advances the understanding of model merging dynamics, offering an effective methodology to merge models without requiring additional training. Code is available at https://github.com/danielm1405/iso-merging .
Chinese: 本文提出了一种各向同性的模型融合框架,通过平坦化奇异值谱并结合公共与任务特定子空间,无需额外训练即可实现最先进的性能。
English: This paper introduces an isotropic merging framework that enhances model merging by flattening singular value spectra and incorporating common and task-specific subspaces, achieving state-of-the-art performance without additional training.

Authors:Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, Shi Feng
Title: SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model
Abstract:
Fine-tuning is a key approach for adapting language models to specific downstream tasks, but updating all model parameters becomes impractical as model sizes increase. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this challenge by introducing additional adaptation parameters into pre-trained weight matrices. However, LoRA's performance varies across different insertion points within the model, highlighting potential parameter inefficiency due to unnecessary insertions. To this end, we propose SSMLoRA (State Space Model Low-Rank Adaptation), an extension of LoRA that incorporates a State Space Model (SSM) to interconnect low-rank matrices. SSMLoRA ensures that performance is maintained even with sparser insertions. SSMLoRA allows the model to not only map inputs to a low-rank space for better feature extraction but also leverage the computations from the previous low-rank space. Our method achieves comparable performance to LoRA on the General Language Understanding Evaluation (GLUE) benchmark while using only half the parameters. Additionally, due to its structure, SSMLoRA shows promise in handling tasks with longer input sequences. .You can find our code here:https://github.com/yuhkalhic/SSMLoRA.
Chinese: SSMLoRA通过引入状态空间模型连接低秩矩阵,在GLUE基准测试中以仅一半参数实现与LoRA相当的性能,并展现出处理长输入序列的潜力。
English: SSMLoRA enhances LoRA by integrating a State Space Model to connect low-rank matrices, achieving comparable performance on the GLUE benchmark with only half the parameters and showing potential for longer input sequences.

Authors:Craig Myles, In Hwa Um, Craig Marshall, David Harris-Birtill, David J. Harrison
Title: SurGen: 1020 H&E-stained Whole Slide Images With Survival and Genetic Markers
Abstract:
$\textbf{Background}$: Cancer remains one of the leading causes of morbidity and mortality worldwide. Comprehensive datasets that combine histopathological images with genetic and survival data across various tumour sites are essential for advancing computational pathology and personalised medicine. $\textbf{Results}$: We present SurGen, a dataset comprising 1,020 H&E-stained whole slide images (WSIs) from 843 colorectal cancer cases. The dataset includes detailed annotations for key genetic mutations (KRAS, NRAS, BRAF) and mismatch repair status, as well as survival data for 426 cases. To demonstrate SurGen's practical utility, we conducted a proof-of-concept machine learning experiment predicting mismatch repair status from the WSIs, achieving a test AUROC of 0.8316. These preliminary results underscore the dataset's potential to facilitate research in biomarker discovery, prognostic modelling, and advanced machine learning applications in colorectal cancer. $\textbf{Conclusions}$: SurGen offers a valuable resource for the scientific community, enabling studies that require high-quality WSIs linked with comprehensive clinical and genetic information on colorectal cancer. Our initial findings affirm the dataset's capacity to advance diagnostic precision and foster the development of personalised treatment strategies in colorectal oncology. Data available online at https://doi.org/10.6019/S-BIAD1285.
中文:SurGen数据集整合了1020张结直肠癌全切片图像及其基因突变注释与生存数据,通过机器学习模型成功预测错配修复状态,证实了其在推进精准医疗研究中的实用价值。
English: The SurGen dataset provides 1,020 whole slide images from colorectal cancer cases with genetic mutation annotations and survival data, demonstrating its utility through a machine learning model that achieved high accuracy in predicting mismatch repair status.

Authors:Juan Miguel Lopez Alcaraz, Ebenezer Oloyede, David Taylor, Wilhelm Haverkamp, Nils Strodthoff
Title: Explainable and externally validated machine learning for neuropsychiatric diagnosis via electrocardiograms
Abstract:
Electrocardiogram (ECG) analysis has emerged as a promising tool for identifying physiological changes associated with neuropsychiatric conditions. The relationship between cardiovascular health and neuropsychiatric disorders suggests that ECG abnormalities could serve as valuable biomarkers for more efficient detection, therapy monitoring, and risk stratification. However, the potential of the ECG to accurately distinguish neuropsychiatric conditions, particularly among diverse patient populations, remains underexplored. This study utilized ECG markers and basic demographic data to predict neuropsychiatric conditions using machine learning models, with targets defined through ICD-10 codes. Both internal and external validation were performed using the MIMIC-IV and ECG-View datasets respectively. Performance was assessed using AUROC scores. To enhance model interpretability, Shapley values were applied to provide insights into the contributions of individual ECG features to the predictions. Significant predictive performance was observed for conditions within the neurological and psychiatric groups. For the neurological group, Alzheimer's disease (G30) achieved an internal AUROC of 0.813 (0.812-0.814) and an external AUROC of 0.868 (0.867-0.868). In the psychiatric group, unspecified dementia (F03) showed an internal AUROC of 0.849 (0.848-0.849) and an external AUROC of 0.862 (0.861-0.863). Discriminative features align with known ECG markers but also provide hints on potentially new markers. ECG offers significant promise for diagnosing and monitoring neuropsychiatric conditions, with robust predictive performance across internal and external cohorts. Future work should focus on addressing potential confounders, such as therapy-related cardiotoxicity, and expanding the scope of ECG applications, including personalized care and early intervention strategies.
中文: 本研究证明通过机器学习分析心电图标记可有效预测神经精神疾病,在内部和外部验证中均表现稳健,同时揭示了潜在的新型诊断生物标志物。
English: This study demonstrates that electrocardiogram (ECG) markers analyzed through machine learning can effectively predict neuropsychiatric conditions, achieving robust performance in both internal and external validations while revealing potential new diagnostic biomarkers.

Authors:Etienne Gauthier, Francis Bach, Michael I. Jordan
Title: Statistical Collusion by Collectives on Learning Platforms
Abstract:
As platforms increasingly rely on learning algorithms, collectives may form and seek ways to influence these platforms to align with their own interests. This can be achieved by coordinated submission of altered data. To evaluate the potential impact of such behavior, it is essential to understand the computations that collectives must perform to impact platforms in this way. In particular, collectives need to make a priori assessments of the effect of the collective before taking action, as they may face potential risks when modifying their data. Moreover they need to develop implementable coordination algorithms based on quantities that can be inferred from observed data. We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain.
中文: 群体可通过协调提交篡改数据来影响平台,这需要事先评估影响并制定可实施的算法,我们的框架在理论及产品评估实验中对此进行了研究。
English: Collectives can influence platforms by coordinating altered data submissions, requiring a priori impact assessments and implementable algorithms, which our framework theoretically and experimentally addresses in product evaluation.

Authors:Alexandre Cionca, Chun Hei Michael Chan, Dimitri Van De Ville
Title: Community detection for directed networks revisited using bimodularity
Abstract:
Community structure is a key feature omnipresent in real-world network data. Plethora of methods have been proposed to reveal subsets of densely interconnected nodes using criteria such as the modularity index. These approaches have been successful for undirected graphs, but directed edge information has not yet been dealt with in a satisfactory way. Here, we revisit the concept of directed communities as a mapping between sending and receiving communities. This translates into a new definition that we term bimodularity. Using convex relaxation, bimodularity can be optimized with the singular value decomposition of the directed modularity matrix. Subsequently, we propose an edge-based clustering approach to reveal the directed communities including their mappings. The feasibility of the new framework is illustrated on a synthetic model and further applied to the neuronal wiring diagram of the \textit{C. elegans}, for which it yields meaningful feedforward loops of the head and body motion systems. This framework sets the ground for the understanding and detection of community structures in directed networks.
中文摘要:本文提出了一种名为双模块化的新框架,通过发送与接收节点的映射关系,利用奇异值分解检测有向网络中的社区结构,并在合成模型和秀丽隐杆线虫神经元图谱中得到验证。
English Summary: This paper introduces a novel framework called bimodularity for detecting directed communities in networks by mapping sending and receiving nodes through singular value decomposition, validated on synthetic models and the C. elegans neuronal diagram.

Authors:Yijun Wang, Yong Wang, Chendong xu, Shuai Yao, Qisong Wu
Title: SelaFD:Seamless Adaptation of Vision Transformer Fine-tuning for Radar-based Human Activity Recognition
Abstract:
Human Activity Recognition (HAR) such as fall detection has become increasingly critical due to the aging population, necessitating effective monitoring systems to prevent serious injuries and fatalities associated with falls. This study focuses on fine-tuning the Vision Transformer (ViT) model specifically for HAR using radar-based Time-Doppler signatures. Unlike traditional image datasets, these signals present unique challenges due to their non-visual nature and the high degree of similarity among various activities. Directly fine-tuning the ViT with all parameters proves suboptimal for this application. To address this challenge, we propose a novel approach that employs Low-Rank Adaptation (LoRA) fine-tuning in the weight space to facilitate knowledge transfer from pre-trained ViT models. Additionally, to extract fine-grained features, we enhance feature representation through the integration of a serial-parallel adapter in the feature space. Our innovative joint fine-tuning method, tailored for radar-based Time-Doppler signatures, significantly improves HAR accuracy, surpassing existing state-of-the-art methodologies in this domain. Our code is released at https://github.com/wangyijunlyy/SelaFD.
中文: 本研究提出了一种新颖的联合微调方法,通过低秩适应和串并联适配器优化视觉变换器在雷达时频信号上的人类活动识别性能,实现了超越现有方法的更高精度。
English: This study introduces a novel joint fine-tuning approach using Low-Rank Adaptation and serial-parallel adapters to optimize Vision Transformers for human activity recognition with radar-based Time-Doppler signatures, achieving superior accuracy over existing methods.

Authors:Yuwei Yin, Giuseppe Carenini
Title: ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.
中文: 本文提出的ARR方法通过意图分析、信息检索和逐步推理三大步骤,显著提升了大语言模型在问答任务中的表现,并在多种测试中展现出优于基线方法的稳定性和泛化能力。
English: This paper introduces ARR, a novel question-answering method that enhances LLM performance through intent analysis, information retrieval, and step-by-step reasoning, demonstrating consistent superiority across diverse tasks and robust generalizability.

Authors:Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
Title: AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts
Abstract:
Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. First, human preferences are complex and influenced by multiple factors, including their content, such as brand names, and their linguistic styles, making analysis challenging. Second, publicly available ad text datasets that include human preferences are lacking, such as ad performance metrics and human feedback, which reflect people's interests. To address these problems, we present AdParaphrase, a paraphrase dataset that contains human preferences for pairs of ad texts that are semantically equivalent but differ in terms of wording and style. This dataset allows for preference analysis that focuses on the differences in linguistic features. Our analysis revealed that ad texts preferred by human judges have higher fluency, longer length, more nouns, and use of bracket symbols. Furthermore, we demonstrate that an ad text-generation model that considers these findings significantly improves the attractiveness of a given text. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase.
中文摘要:本研究推出AdParaphrase数据集,通过分析语言特征(如流畅度与措辞)对广告文本偏好的影响,证明整合这些发现能显著提升广告吸引力。
English Summary: This study introduces AdParaphrase, a dataset enabling analysis of how linguistic features like fluency and word choice influence ad text preferences, and demonstrates that incorporating these insights significantly enhances ad attractiveness.

Authors:Amitayush Thakur, George Tsoukalas, Greg Durrett, Swarat Chaudhuri
Title: ProofWala: Multilingual Proof Data Synthesis and Theorem-Proving
Abstract:
Neural networks have shown substantial promise at automatic theorem-proving in interactive proof assistants (ITPs) like Lean and Coq. However, most neural theorem-proving models are restricted to specific ITPs, leaving out opportunities for cross-lingual $\textit{transfer}$ between ITPs. We address this weakness with a multilingual proof framework, ${\rm P{\small ROOF}W{\small ALA}}$, that allows a standardized form of interaction between neural theorem-provers and two established ITPs (Coq and Lean). It enables the collection of multilingual proof step data -- data recording the result of proof actions on ITP states -- for training neural provers. ${\rm P{\small ROOF}W{\small ALA}}$ allows the systematic evaluation of a model's performance across different ITPs and problem domains via efficient parallel proof search algorithms. We show that multilingual training enabled by ${\rm P{\small ROOF}W{\small ALA}}$ can lead to successful transfer across ITPs. Specifically, a model trained on a mix of ${\rm P{\small ROOF}W{\small ALA}}$-generated Coq and Lean data outperforms Lean-only and Coq-only models on the standard prove-at-$k$ metric. We open source all code including code for the ${\rm P{\small ROOF}W{\small ALA}}$ Framework (https://github.com/trishullab/proof-wala), and the Multilingual ITP interaction framework (https://github.com/trishullab/itp-interface).
中文:多语言证明框架ProofWala实现了Coq和Lean等交互式定理证明器间的跨语言迁移,其多语言训练模型在标准评估指标上优于单一语言模型。
English: The multilingual proof framework ProofWala enables cross-lingual transfer between interactive theorem provers like Coq and Lean, with multilingual training demonstrating superior performance over single-language models.

Authors:Zhiqiang Yang, Qiu Guan, Zhongwen Yu, Xinli Xu, Haixia Long, Sheng Lian, Haigen Hu, Ying Tang
Title: MHAF-YOLO: Multi-Branch Heterogeneous Auxiliary Fusion YOLO for accurate object detection
Abstract:
Due to the effective multi-scale feature fusion capabilities of the Path Aggregation FPN (PAFPN), it has become a widely adopted component in YOLO-based detectors. However, PAFPN struggles to integrate high-level semantic cues with low-level spatial details, limiting its performance in real-world applications, especially with significant scale variations. In this paper, we propose MHAF-YOLO, a novel detection framework featuring a versatile neck design called the Multi-Branch Auxiliary FPN (MAFPN), which consists of two key modules: the Superficial Assisted Fusion (SAF) and Advanced Assisted Fusion (AAF). The SAF bridges the backbone and the neck by fusing shallow features, effectively transferring crucial low-level spatial information with high fidelity. Meanwhile, the AAF integrates multi-scale feature information at deeper neck layers, delivering richer gradient information to the output layer and further enhancing the model learning capacity. To complement MAFPN, we introduce the Global Heterogeneous Flexible Kernel Selection (GHFKS) mechanism and the Reparameterized Heterogeneous Multi-Scale (RepHMS) module to enhance feature fusion. RepHMS is globally integrated into the network, utilizing GHFKS to select larger convolutional kernels for various feature layers, expanding the vertical receptive field and capturing contextual information across spatial hierarchies. Locally, it optimizes convolution by processing both large and small kernels within the same layer, broadening the lateral receptive field and preserving crucial details for detecting smaller targets. The source code of this work is available at: https://github.com/yang-0201/MHAF-YOLO.
中文: MHAF-YOLO框架采用多分支辅助特征金字塔网络,通过浅层辅助融合和高级辅助融合模块优化多尺度特征整合,并结合全局异构卷积核选择与重参数化多尺度模块,有效扩大感受野并保留细节,显著提升了多尺度目标检测性能。
English: The MHAF-YOLO framework introduces a Multi-Branch Auxiliary FPN with SAF and AAF modules to better fuse multi-scale features, complemented by GHFKS and RepHMS mechanisms that enhance receptive fields and preserve details for improved object detection across varying scales.

Authors:Lin Tian, Emily Booth, Francesco Bailo, Julian Droogan, Marian-Andrei Rizoiu
Title: Before It's Too Late: A State Space Model for the Early Prediction of Misinformation and Disinformation Engagement
Abstract:
In today's digital age, conspiracies and information campaigns can emerge rapidly and erode social and democratic cohesion. While recent deep learning approaches have made progress in modeling engagement through language and propagation models, they struggle with irregularly sampled data and early trajectory assessment. We present IC-Mamba, a novel state space model that forecasts social media engagement by modeling interval-censored data with integrated temporal embeddings. Our model excels at predicting engagement patterns within the crucial first 15-30 minutes of posting (RMSE 0.118-0.143), enabling rapid assessment of content reach. By incorporating interval-censored modeling into the state space framework, IC-Mamba captures fine-grained temporal dynamics of engagement growth, achieving a 4.72% improvement over state-of-the-art across multiple engagement metrics (likes, shares, comments, and emojis). Our experiments demonstrate IC-Mamba's effectiveness in forecasting both post-level dynamics and broader narrative patterns (F1 0.508-0.751 for narrative-level predictions). The model maintains strong predictive performance across extended time horizons, successfully forecasting opinion-level engagement up to 28 days ahead using observation windows of 3-10 days. These capabilities enable earlier identification of potentially problematic content, providing crucial lead time for designing and implementing countermeasures. Code is available at: https://github.com/ltian678/ic-mamba. An interactive dashboard demonstrating our results is available at: https://ic-mamba.behavioral-ds.science.
中文: IC-Mamba是一种新颖的状态空间模型,通过结合时间嵌入处理区间删失数据,能有效预测社交媒体参与度,在早期轨迹评估和长期预测方面均表现优异。
English: IC-Mamba is a novel state space model that effectively forecasts social media engagement by modeling interval-censored data with temporal embeddings, achieving superior performance in early trajectory assessment and extended time horizon predictions.

Authors:Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin
Title: Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
Abstract:
We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Agentic Reasoning dynamically leverages web search, code execution, and structured memory to address complex problems requiring deep research. A key innovation in our framework is the Mind-Map agent, which constructs a structured knowledge graph to store reasoning context and track logical relationships, ensuring coherence in long reasoning chains with extensive tool usage. Additionally, we conduct a comprehensive exploration of the Web-Search agent, leading to a highly effective search mechanism that surpasses all prior approaches. When deployed on DeepSeek-R1, our method achieves a new state-of-the-art (SOTA) among public models and delivers performance comparable to OpenAI Deep Research, the leading proprietary model in this domain. Extensive ablation studies validate the optimal selection of agentic tools and confirm the effectiveness of our Mind-Map and Web-Search agents in enhancing LLM reasoning. The code is at: https://github.com/theworldofagents/Agentic-Reasoning
Chinese: Agentic Reasoning框架通过整合网络搜索、代码执行和思维导图代理等动态工具,增强了大型语言模型的推理能力,其性能达到了与领先专有模型相媲美的顶尖水平。
English: The Agentic Reasoning framework enhances large language model reasoning by integrating dynamic tools like web search, code execution, and a Mind-Map agent for structured knowledge tracking, achieving state-of-the-art performance comparable to leading proprietary models.

Authors:Brian Formento, Chuan Sheng Foo, See-Kiong Ng
Title: Confidence Elicitation: A New Attack Vector for Large Language Models
Abstract:
A fundamental issue in deep learning has been adversarial robustness. As these systems have scaled, such issues have persisted. Currently, large language models (LLMs) with billions of parameters suffer from adversarial attacks just like their earlier, smaller counterparts. However, the threat models have changed. Previously, having gray-box access, where input embeddings or output logits/probabilities were visible to the user, might have been reasonable. However, with the introduction of closed-source models, no information about the model is available apart from the generated output. This means that current black-box attacks can only utilize the final prediction to detect if an attack is successful. In this work, we investigate and demonstrate the potential of attack guidance, akin to using output probabilities, while having only black-box access in a classification setting. This is achieved through the ability to elicit confidence from the model. We empirically show that the elicited confidence is calibrated and not hallucinated for current LLMs. By minimizing the elicited confidence, we can therefore increase the likelihood of misclassification. Our new proposed paradigm demonstrates promising state-of-the-art results on three datasets across two models (LLaMA-3-8B-Instruct and Mistral-7B-Instruct-V0.3) when comparing our technique to existing hard-label black-box attack methods that introduce word-level substitutions.
中文: 本研究提出了一种新型黑盒攻击方法,通过从大语言模型中获取校准后的置信度分数来指导对抗性攻击,在无需模型内部信息的情况下仅通过词级替换就实现了最先进的误分类效果。
English: This study introduces a novel black-box attack method that guides adversarial attacks by eliciting calibrated confidence scores from large language models, achieving state-of-the-art misclassification rates through word-level substitutions without accessing internal model information.

Authors:Yong Li, Yingjing Huang, Gengchen Mai, Fan Zhang
Title: Learning Street View Representations with Spatiotemporal Contrast
Abstract:
Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is challenging for existing image representations to specifically encode the dynamic urban environment (such as pedestrians, vehicles, and vegetation), the built environment (including buildings, roads, and urban infrastructure), and the environmental ambiance (such as the cultural and socioeconomic atmosphere) depicted in street view imagery to address downstream tasks related to the city. In this work, we propose an innovative self-supervised learning framework that leverages temporal and spatial attributes of street view imagery to learn image representations of the dynamic urban environment for diverse downstream tasks. By employing street view images captured at the same location over time and spatially nearby views at the same time, we construct contrastive learning tasks designed to learn the temporal-invariant characteristics of the built environment and the spatial-invariant neighborhood ambiance. Our approach significantly outperforms traditional supervised and unsupervised methods in tasks such as visual place recognition, socioeconomic estimation, and human-environment perception. Moreover, we demonstrate the varying behaviors of image representations learned through different contrastive learning objectives across various downstream tasks. This study systematically discusses representation learning strategies for urban studies based on street view images, providing a benchmark that enhances the applicability of visual data in urban science. The code is available at https://github.com/yonglleee/UrbanSTCL.
中文摘要:本研究提出一种创新自监督学习框架,利用街景图像的时空特性学习城市动态环境表征,在视觉场所识别、社会经济评估等任务中显著优于传统方法,为城市研究提供了有效的视觉表征基准。
English Summary: This paper introduces a self-supervised learning framework that uses temporal and spatial attributes of street view imagery to effectively represent dynamic urban environments, built environments, and neighborhood ambiance, outperforming existing methods in various urban tasks.

Authors:Amy Smith, Barrett R. Anderson, Jasmine Tan Otto, Isaac Karth, Yuqian Sun, John Joon Young Chung, Melissa Roemmele, Max Kreminski
Title: Fuzzy Linkography: Automatic Graphical Summarization of Creative Activity Traces
Abstract:
Linkography -- the analysis of links between the design moves that make up an episode of creative ideation or design -- can be used for both visual and quantitative assessment of creative activity traces. Traditional linkography, however, is time-consuming, requiring a human coder to manually annotate both the design moves within an episode and the connections between them. As a result, linkography has not yet been much applied at scale. To address this limitation, we introduce fuzzy linkography: a means of automatically constructing a linkograph from a sequence of recorded design moves via a "fuzzy" computational model of semantic similarity, enabling wider deployment and new applications of linkographic techniques. We apply fuzzy linkography to three markedly different kinds of creative activity traces (text-to-image prompting journeys, LLM-supported ideation sessions, and researcher publication histories) and discuss our findings, as well as strengths, limitations, and potential future applications of our approach.
中文: 模糊链接图通过语义相似性自动分析创意设计步骤,克服了传统方法耗时且依赖人工的局限,实现了更广泛的应用。
English: Fuzzy linkography automates the analysis of creative design moves using semantic similarity, enabling scalable and diverse applications beyond traditional manual methods.

Authors:Kunxiao Liu, Guowu Yuan, Hongyu Liu, Hao Wu
Title: Multiscale style transfer based on a Laplacian pyramid for traditional Chinese painting
Abstract:
Style transfer is adopted to synthesize appealing stylized images that preserve the structure of a content image but carry the pattern of a style image. Many recently proposed style transfer methods use only western oil paintings as style images to achieve image stylization. As a result, unnatural messy artistic effects are produced in stylized images when using these methods to directly transfer the patterns of traditional Chinese paintings, which are composed of plain colors and abstract objects. Moreover, most of them work only at the original image scale and thus ignore multiscale image information during training. In this paper, we present a novel effective multiscale style transfer method based on Laplacian pyramid decomposition and reconstruction, which can transfer unique patterns of Chinese paintings by learning different image features at different scales. In the first stage, the holistic patterns are transferred at low resolution by adopting a Style Transfer Base Network. Then, the details of the content and style are gradually enhanced at higher resolutions by a Detail Enhancement Network with an edge information selection (EIS) module in the second stage. The effectiveness of our method is demonstrated through the generation of appealing high-quality stylization results and a comparison with some state-of-the-art style transfer methods. Datasets and codes are available at https://github.com/toby-katakuri/LP_StyleTransferNet.
Chinese: 本文提出了一种基于拉普拉斯金字塔分解的多尺度风格迁移方法,通过在不同尺度学习图像特征,有效迁移中国传统绘画的独特图案,解决了现有方法在处理此类风格时产生不自然效果且忽略多尺度信息的问题。
English: This paper introduces a multiscale style transfer method using Laplacian pyramid decomposition to effectively transfer patterns from traditional Chinese paintings by learning features at different scales, overcoming limitations of existing methods that produce unnatural results with such styles and ignore multiscale information.

Authors:Sandra C. Sandoval, Christabel Acquaye, Kwesi Cobbina, Mohammad Nayeem Teli, Hal Daumé
Title: My LLM might Mimic AAE -- But When Should it?
Abstract:
We examine the representation of African American English (AAE) in large language models (LLMs), exploring (a) the perceptions Black Americans have of how effective these technologies are at producing authentic AAE, and (b) in what contexts Black Americans find this desirable. Through both a survey of Black Americans ($n=$ 104) and annotation of LLM-produced AAE by Black Americans ($n=$ 228), we find that Black Americans favor choice and autonomy in determining when AAE is appropriate in LLM output. They tend to prefer that LLMs default to communicating in Mainstream U.S. English in formal settings, with greater interest in AAE production in less formal settings. When LLMs were appropriately prompted and provided in context examples, our participants found their outputs to have a level of AAE authenticity on par with transcripts of Black American speech. Select code and data for our project can be found here: https://github.com/smelliecat/AAEMime.git
中文摘要:本研究探讨了美国黑人对大型语言模型中非裔美国人英语(AAE)真实性与适用性的看法,发现他们更倾向于自主决定AAE的使用场景——在正式场合偏好主流英语,非正式场合则接受AAE输出,且经恰当提示的模型生成的AAE真实性可与真人语音媲美。
English Summary: The study investigates Black Americans' views on the authenticity and desirability of African American English (AAE) in large language models, finding they prefer autonomy in choosing AAE usage—favoring it in informal contexts while preferring mainstream English in formal settings, with appropriately prompted models achieving speech authenticity comparable to human transcripts.

Authors:Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai
Title: WaferLLM: Large Language Model Inference at Wafer Scale
Abstract:
Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to exploit these accelerators fully. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale accelerators. Evaluations show that WaferLLM achieves up to 200$\times$ higher accelerator utilization than state-of-the-art methods. Leveraging a wafer-scale accelerator (Cerebras WSE2), WaferLLM delivers GEMV operations 606$\times$ faster and 16$\times$ more energy-efficient than on an NVIDIA A100 GPU. For full LLM inference, WaferLLM achieves 10-20$\times$ speedups over A100 GPU clusters running SGLang and vLLM. These advantages are expected to grow as wafer-scale AI models, software, and hardware continue to mature. WaferLLM is open-sourced at https://github.com/MeshInfra/WaferLLM.
中文: WaferLLM是首个晶圆级大语言模型推理系统,通过创新的PLMR模型和晶圆级并行技术优化AI加速器性能,相比现有方法实现了高达200倍的利用率提升及显著的运行速度和能效优势。
English: WaferLLM is the first wafer-scale LLM inference system that introduces a novel PLMR model and wafer-scale parallelism to optimize performance on AI accelerators, achieving up to 200x higher utilization and significant speed and energy efficiency improvements over existing methods.

Authors:Keshav Bhandari, Sungkyun Chang, Tongyu Lu, Fareza R. Enus, Louis B. Bradshaw, Dorien Herremans, Simon Colton
Title: ImprovNet -- Generating Controllable Musical Improvisations with Iterative Corruption Refinement
Abstract:
Despite deep learning's remarkable advances in style transfer across various domains, generating controllable performance-level musical style transfer for complete symbolically represented musical works remains a challenging area of research. Much of this is owed to limited datasets, especially for genres such as jazz, and the lack of unified models that can handle multiple music generation tasks. This paper presents ImprovNet, a transformer-based architecture that generates expressive and controllable musical improvisations through a self-supervised corruption-refinement training strategy. The improvisational style transfer is aimed at making meaningful modifications to one or more musical elements - melody, harmony or rhythm of the original composition with respect to the target genre. ImprovNet unifies multiple capabilities within a single model: it can perform cross-genre and intra-genre improvisations, harmonize melodies with genre-specific styles, and execute short prompt continuation and infilling tasks. The model's iterative generation framework allows users to control the degree of style transfer and structural similarity to the original composition. Objective and subjective evaluations demonstrate ImprovNet's effectiveness in generating musically coherent improvisations while maintaining structural relationships with the original pieces. The model outperforms Anticipatory Music Transformer in short continuation and infilling tasks and successfully achieves recognizable genre conversion, with 79\% of participants correctly identifying jazz-style improvisations of classical pieces. Our code and demo page can be found at https://github.com/keshavbhandari/improvnet.
中文: ImprovNet是一种基于Transformer的模型,能够通过迭代生成实现对完整符号音乐作品的可控风格转换,统一了跨流派即兴创作、旋律和声化及结构任务,在保持音乐连贯性的同时,其评估表现优于现有方法。
English: ImprovNet is a transformer-based model that enables controllable musical style transfer for complete symbolic compositions, unifying cross-genre improvisation, melody harmonization, and structural tasks through iterative generation while maintaining musical coherence and outperforming existing methods in evaluations.

Authors:Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, Hao Zhang
Title: Fast Video Generation with Sliding Tile Attention
Abstract:
Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench. We make our codebase public at https://github.com/hao-ai-lab/FastVideo.
中文: 本文提出滑动分块注意力(STA),通过聚焦局部三维窗口的硬件高效方法,将视频生成中的注意力计算时间最多减少17倍且不损失质量。
English: This paper introduces sliding tile attention (STA), a hardware-efficient method that accelerates video generation by focusing on localized 3D windows, reducing attention computation time by up to 17 times without quality loss.

Authors:Shurui Gui, Xiner Li, Shuiwang Ji
Title: Discovering Physics Laws of Dynamical Systems via Invariant Function Learning
Abstract:
We consider learning underlying laws of dynamical systems governed by ordinary differential equations (ODE). A key challenge is how to discover intrinsic dynamics across multiple environments while circumventing environment-specific mechanisms. Unlike prior work, we tackle more complex environments where changes extend beyond function coefficients to entirely different function forms. For example, we demonstrate the discovery of ideal pendulum's natural motion $α^2 \sin{θ_t}$ by observing pendulum dynamics in different environments, such as the damped environment $α^2 \sin(θ_t) - ρω_t$ and powered environment $α^2 \sin(θ_t) + ρ\frac{ω_t}{\left|ω_t\right|}$. Here, we formulate this problem as an \emph{invariant function learning} task and propose a new method, known as \textbf{D}isentanglement of \textbf{I}nvariant \textbf{F}unctions (DIF), that is grounded in causal analysis. We propose a causal graph and design an encoder-decoder hypernetwork that explicitly disentangles invariant functions from environment-specific dynamics. The discovery of invariant functions is guaranteed by our information-based principle that enforces the independence between extracted invariant functions and environments. Quantitative comparisons with meta-learning and invariant learning baselines on three ODE systems demonstrate the effectiveness and efficiency of our method. Furthermore, symbolic regression explanation results highlight the ability of our framework to uncover intrinsic laws. Our code has been released as part of the AIRS library (\href{https://github.com/divelab/AIRS/tree/main/OpenODE/DIF}{https://github.com/divelab/AIRS/}).
Chinese: 本研究提出了一种名为解耦不变函数(DIF)的新方法,通过因果分析和编码器-解码器超网络,在多环境中学习常微分方程系统的内在动力学,基于信息论原则保证不变函数的发现,并在揭示基本规律方面显示出高效性。
English: This study introduces the Disentanglement of Invariant Functions (DIF) method, which uses causal analysis and an encoder-decoder hypernetwork to learn the intrinsic dynamics of ODE systems across varied environments, ensuring discovery through an information-based principle and demonstrating effectiveness in uncovering fundamental laws.

Authors:Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Zachary Yahn, Ling Liu
Title: Multi-Agent Reinforcement Learning with Focal Diversity Optimization
Abstract:
The advancement of Large Language Models (LLMs) and their finetuning strategies has triggered the renewed interests in multi-agent reinforcement learning. In this paper, we introduce a focal diversity-optimized multi-agent reinforcement learning approach, coined as MARL-Focal, with three unique characteristics. First, we develop an agent-fusion framework for encouraging multiple LLM based agents to collaborate in producing the final inference output for each LLM query. Second, we develop a focal-diversity optimized agent selection algorithm that can choose a small subset of the available agents based on how well they can complement one another to generate the query output. Finally, we design a conflict-resolution method to detect output inconsistency among multiple agents and produce our MARL-Focal output through reward-aware and policy-adaptive inference fusion. Extensive evaluations on five benchmarks show that MARL-Focal is cost-efficient and adversarial-robust. Our multi-agent fusion model achieves performance improvement of 5.51\% compared to the best individual LLM-agent and offers stronger robustness over the TruthfulQA benchmark. Code is available at https://github.com/sftekin/rl-focal
中文摘要:本文提出的MARL-Focal方法通过多智能体协作框架、聚焦多样性优化选择机制和冲突解决方案,显著提升大语言模型的性能表现与抗干扰能力,在多个基准测试中展现出优越效果。
English Summary: This paper introduces MARL-Focal, a diversity-optimized multi-agent reinforcement learning approach that enhances LLM performance through agent collaboration, intelligent selection, and conflict resolution, achieving significant efficiency and robustness gains.

Authors:Soham Deshmukh, Shuo Han, Rita Singh, Bhiksha Raj
Title: ADIFF: Explaining audio difference using natural language
Abstract:
Understanding and explaining differences between audio recordings is crucial for fields like audio forensics, quality assessment, and audio generation. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners. This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the baseline, we use prefix tuning where audio embeddings from two audio files are used to prompt a frozen language model. Our empirical analysis and ablation studies reveal that the naive baseline struggles to distinguish perceptually similar sounds and generate detailed tier 3 explanations. To address these limitations, we propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations. We evaluate our model using objective metrics and human evaluation and show our model enhancements lead to significant improvements in performance over naive baseline and SoTA Audio-Language Model (ALM) Qwen Audio. Lastly, we conduct multiple ablation studies to study the effects of cross-projection, language model parameters, position captioning, third stage fine-tuning, and present our findings. Our benchmarks, findings, and strong baseline pave the way for nuanced and human-like explanations of audio differences.
本文首次系统研究音频差异解释任务,提出新数据集和ADIFF模型,通过生成音频事件、场景及情感影响的详细描述,显著优于现有基准模型和最先进音频语言模型。
This paper introduces the first comprehensive study on explaining audio differences, proposing new datasets and a model called ADIFF that significantly outperforms naive baselines and state-of-the-art models by generating detailed descriptions of audio events, scenes, and emotional impacts.

Authors:Imad Eddine Marouf, Enzo Tartaglione, Stephane Lathuiliere, Joost van de Weijer
Title: Ask and Remember: A Questions-Only Replay Strategy for Continual Visual Question Answering
Abstract:
Continual Learning in Visual Question Answering (VQACL) requires models to acquire new visual-linguistic skills (plasticity) while preserving previously learned knowledge (stability). The inherent multimodality of VQACL exacerbates this challenge, as models must balance stability across visual and textual domains while adapting to novel objects and reasoning tasks. Existing methods, primarily designed for unimodal settings, often fall short in addressing this dual requirement. In this work, we present QUestion-only replay with Attention Distillation (QUAD), a novel approach for VQACL that leverages only past task questions for regularization. By eliminating the need to store visual data, QUAD not only reduces memory overhead, but also alleviates privacy concerns. Our method introduces a Question-only Replay mechanism that selectively reuses prior task questions to counteract overfitting to the answer space of the current task, addressing the problem out of answer set. Complementing this, we propose Attention Consistency Distillation to enforce both intra-modal and inter-modal attention consistency across tasks, preserving essential visual-linguistic associations. Extensive experiments on VQAv2 and NExT-QA demonstrate that QUAD significantly outperforms state-of-the-art methods, achieving robust performance in continual VQA. Code is available at: https://github.com/IemProg/QUAD.
中文:QUAD是一种新颖的视觉问答持续学习方法,仅通过问题回放和注意力蒸馏机制,在不存储视觉数据的情况下保持跨任务稳定性,在基准数据集上实现了卓越性能。
English: QUAD is a novel continual learning method for Visual Question Answering that uses question-only replay and attention distillation to maintain stability across tasks without storing visual data, achieving superior performance on benchmark datasets.

Authors:Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan
Title: KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Abstract:
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.
中文摘要:KV缓存量化可提升大语言模型在长上下文和大批量场景下的推理效率,而KVTuner框架通过自适应优化分层量化精度,实现了近乎无损的性能和显著的吞吐量提升。
English Summary: KV cache quantization enhances LLM inference efficiency in long contexts and large batches, and the proposed KVTuner framework adaptively optimizes layer-wise quantization precision to achieve nearly lossless performance with significant throughput improvements.

Authors:Edgar Ramirez-Sanchez, Catherine Tang, Yaosheng Xu, Nrithya Renganathan, Vindula Jayawardana, Zhengbing He, Cathy Wu
Title: NeuralMOVES: A lightweight and microscopic vehicle emission estimation model based on reverse engineering and surrogate learning
Abstract:
The transportation sector significantly contributes to greenhouse gas emissions, necessitating accurate emission models to guide mitigation strategies. Despite its field validation and certification, the industry-standard Motor Vehicle Emission Simulator (MOVES) faces challenges related to complexity in usage, high computational demands, and its unsuitability for microscopic real-time applications. To address these limitations, we present NeuralMOVES, a comprehensive suite of high-performance, lightweight surrogate models for vehicle CO2 emissions. Developed based on reverse engineering and Neural Networks, NeuralMOVES achieves a remarkable 6.013% Mean Average Percentage Error relative to MOVES across extensive tests spanning over two million scenarios with diverse trajectories and the factors regarding environments and vehicles. NeuralMOVES is only 2.4 MB, largely condensing the original MOVES and the reverse engineered MOVES into a compact representation, while maintaining high accuracy. Therefore, NeuralMOVES significantly enhances accessibility while maintaining the accuracy of MOVES, simplifying CO2 evaluation for transportation analyses and enabling real-time, microscopic applications across diverse scenarios without reliance on complex software or extensive computational resources. Moreover, this paper provides, for the first time, a framework for reverse engineering industrial-grade software tailored specifically to transportation scenarios, going beyond MOVES. The surrogate models are available at https://github.com/edgar-rs/neuralMOVES.
中文: NeuralMOVES 是一种轻量级高精度车辆二氧化碳排放替代模型,解决了行业标准MOVES模拟器的复杂性和高计算需求问题,仅需2.4 MB存储空间且误差率为6.013%,可实现实时微观应用。
English: NeuralMOVES is a lightweight, high-accuracy surrogate model for vehicle CO₂ emissions that overcomes the complexity and computational demands of the industry-standard MOVES simulator, enabling real-time microscopic applications with only 2.4 MB size and 6.013% error rate.

Authors:Zehua Pei, Lancheng Zou, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu
Title: CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference
Abstract:
Scaling large language models (LLMs) improves performance but dramatically increases inference costs. The feed-forward network (FFN), consuming approximately 70\% of inference compute, represents a critical bottleneck, particularly in large batch size scenarios. While mixture-of-experts (MoE) architectures leverage activation sparsity for efficiency, converting existing dense models to MoEs traditionally requires resource-intensive continual pre-training. We present CMoE, a framework that rapidly transforms dense LLMs into MoEs without training. The key innovation lies in analyzing FFN neuron activations to partition them into shared (always active) and routed experts. Routed neurons are clustered using a balanced assignment algorithm, and a differentiable router is constructed analytically from activation statistics, enabling immediate deployment or optional lightweight fine-tuning. Experiments demonstrate that, with activation ratio of 75\%, it achieves remarkable results, delivering lossless precision in terms of perplexity while still maintaining a 5\% acceleration. Further experiments reveal that a CMoE configuration activating just 25\% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training. Moreover, a brief LoRA fine-tuning process (requiring only 1 hour and 2,000 samples) successfully recovers over 76\% of the dense model's downstream accuracy. By effectively balancing performance and efficiency, CMoE offers a viable path forward for deploying LLMs in real-world scenarios where computational resources are limited. We make our code publicly available at https://github.com/JarvisPei/CMoE.
中文: CMoE是一种无需训练即可将稠密大语言模型转换为混合专家模型的高效框架,通过分析神经元激活实现无损困惑度并提速5%,在仅激活25%参数时延迟降低1.5倍,且通过轻量微调可恢复76%下游任务准确率。
English: CMoE is a training-free framework that converts dense LLMs into efficient mixture-of-expert models by analyzing neuron activations, achieving lossless perplexity with 5% acceleration and enabling 1.5x latency reduction while preserving performance through optional lightweight fine-tuning.

Authors:Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
Title: MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot
Abstract:
Retrieval-augmented generation (RAG) is a well-suited technique for retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a key module of the healthcare copilot, helping reduce misdiagnosis for healthcare practitioners and patients. However, the diagnostic accuracy and specificity of existing heuristic-based RAG models used in the medical domain are inadequate, particularly for diseases with similar manifestations. This paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited reasoning for the medical domain that retrieves diagnosis and treatment recommendations based on manifestations. MedRAG systematically constructs a comprehensive four-tier hierarchical diagnostic KG encompassing critical diagnostic differences of various diseases. These differences are dynamically integrated with similar EHRs retrieved from an EHR database, and reasoned within a large language model. This process enables more accurate and specific decision support, while also proactively providing follow-up questions to enhance personalized medical decision-making. MedRAG is evaluated on both a public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) collected from Tan Tock Seng Hospital, and its performance is compared against various existing RAG methods. Experimental results show that, leveraging the information integration and relational abilities of the KG, our MedRAG provides more specific diagnostic insights and outperforms state-of-the-art models in reducing misdiagnosis rates. Our code will be available at https://github.com/SNOWTEAM2023/MedRAG
中文:MedRAG通过知识图谱推理增强检索生成技术,结合分层诊断知识与电子健康记录,显著提升医疗诊断的准确性和特异性,在降低误诊率方面优于现有方法。
English: MedRAG enhances retrieval-augmented generation with knowledge graph reasoning to improve diagnostic accuracy and specificity in healthcare by integrating hierarchical diagnostic knowledge with electronic health records, outperforming existing methods in reducing misdiagnosis.

Authors:Long Chen, Xiaotian Song, Andy Song, BaDong Chen, Jiancheng Lv, Yanan Sun
Title: FAS: Fast ANN-SNN Conversion for Spiking Large Language Models
Abstract:
Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. Notably, FAS only takes eight timesteps to achieve an accuracy of 3\% higher than that of the OPT-7B model, while reducing energy consumption by 96.63\%. The source code is available at https://github.com/lc783/FAS
中文: 本文提出了一种新颖的快速人工神经网络-脉冲神经网络转换策略(FAS),通过两阶段微调和校准将大语言模型转换为脉冲神经网络,在显著降低延迟和能耗的同时实现了最先进的性能。
English: This paper introduces a novel Fast ANN-SNN conversion strategy (FAS) that transforms large language models into spiking neural networks through two-stage fine-tuning and calibration, achieving state-of-the-art performance with significantly reduced latency and energy consumption.

Authors:Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang
Title: Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
Abstract:
Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose \method, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available at https://github.com/CityMind-Lab/ICML25-TimeVLM.
Chinese: Time-VLM提出了一种新颖的多模态框架,利用预训练的视觉语言模型整合时序、视觉和文本数据以提升时间序列预测性能,在少样本和零样本场景下表现尤为卓越。
English: Time-VLM introduces a novel multimodal framework that integrates temporal, visual, and textual data using pre-trained vision-language models to enhance time series forecasting, achieving superior performance especially in few-shot and zero-shot settings.

Authors:Shue Shiinoki, Ryo Koshihara, Hayato Motegi, Masumi Morishige
Title: Overcoming Vision Language Model Challenges in Diagram Understanding: A Proof-of-Concept with XML-Driven Large Language Models Solutions
Abstract:
Diagrams play a crucial role in visually conveying complex relationships and processes within business documentation. Despite recent advances in Vision-Language Models (VLMs) for various image understanding tasks, accurately identifying and extracting the structures and relationships depicted in diagrams continues to pose significant challenges. This study addresses these challenges by proposing a text-driven approach that bypasses reliance on VLMs' visual recognition capabilities. Instead, it utilizes the editable source files--such as xlsx, pptx or docx--where diagram elements (e.g., shapes, lines, annotations) are preserved as textual metadata. In our proof-of-concept, we extracted diagram information from xlsx-based system design documents and transformed the extracted shape data into textual input for Large Language Models (LLMs). This approach allowed the LLM to analyze relationships and generate responses to business-oriented questions without the bottleneck of image-based processing. Experimental comparisons with a VLM-based method demonstrated that the proposed text-driven framework yielded more accurate answers for questions requiring detailed comprehension of diagram structures.The results obtained in this study are not limited to the tested .xlsx files but can also be extended to diagrams in other documents with source files, such as Office pptx and docx formats. These findings highlight the feasibility of circumventing VLM constraints through direct textual extraction from original source files. By enabling robust diagram understanding through LLMs, our method offers a promising path toward enhanced workflow efficiency and information analysis in real-world business scenarios.
中文: 本研究提出一种文本驱动方法,直接从xlsx、pptx或docx等可编辑源文件中提取图表信息,使大语言模型能够分析关系并回答业务问题,相比视觉语言模型,该方法通过规避视觉识别限制实现了更精准的解析。
English: This study introduces a text-driven method that extracts diagram information directly from editable source files like xlsx, pptx, or docx, enabling Large Language Models to analyze relationships and answer business questions more accurately than Vision-Language Models by bypassing visual recognition limitations.

Authors:Royson Lee, Minyoung Kim, Fady Rezk, Rui Li, Stylianos I. Venieris, Timothy Hospedales
Title: FedP$^2$EFT: Federated Learning to Personalize PEFT for Multilingual LLMs
Abstract:
Federated learning (FL) has enabled the training of multilingual large language models (LLMs) on diverse and decentralized multilingual data, especially on low-resource languages. To improve client-specific performance, personalization via the use of parameter-efficient fine-tuning (PEFT) modules such as LoRA is common. This involves a personalization strategy (PS), such as the design of the PEFT adapter structures (e.g., in which layers to add LoRAs and what ranks) and choice of hyperparameters (e.g., learning rates) for fine-tuning. Instead of manual PS configuration, we propose FedP$^2$EFT, a federated learning-to-personalize method for multilingual LLMs in cross-device FL settings. Unlike most existing PEFT structure selection methods, which are prone to overfitting low-data regimes, FedP$^2$EFT collaboratively learns the optimal personalized PEFT structure for each client via Bayesian sparse rank selection. Evaluations on both simulated and real-world multilingual FL benchmarks demonstrate that FedP$^2$EFT largely outperforms existing personalized fine-tuning methods, while complementing other existing FL methods. Code is available at https://github.com/SamsungLabs/fedp2eft.
中文:FedP$^2$EFT提出了一种联邦学习方法,通过贝叶斯稀疏秩选择协同优化多语言大模型的个性化高效参数微调结构,在跨设备联邦学习基准测试中显著优于现有方法。
English: FedP$^2$EFT introduces a federated learning method that collaboratively optimizes personalized parameter-efficient fine-tuning structures for multilingual LLMs using Bayesian sparse rank selection, significantly outperforming existing approaches on cross-device FL benchmarks.

Authors:Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson
Title: Sparse Autoencoders for Hypothesis Generation
Abstract:
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
HypotheSAEs 是一种计算效率高的方法,通过稀疏自编码器和大型语言模型生成可解释的文本特征假设,用于预测目标变量,在合成和真实数据集上均比基线方法更准确地识别假设并产生更多新发现。
HypotheSAEs is a computationally efficient method that uses sparse autoencoders and LLMs to generate interpretable hypotheses about text features predicting target variables, outperforming baselines in accuracy and discovery on both synthetic and real datasets.

Authors:Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, Seulki Lee
Title: On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices
Abstract:
We present On-device Sora, the first model training-free solution for diffusion-based on-device text-to-video generation that operates efficiently on smartphone-grade devices. To address the challenges of diffusion-based text-to-video generation on computation- and memory-limited mobile devices, the proposed On-device Sora applies three novel techniques to pre-trained video generative models. First, Linear Proportional Leap (LPL) reduces the excessive denoising steps required in video diffusion through an efficient leap-based approach. Second, Temporal Dimension Token Merging (TDTM) minimizes intensive token-processing computation in attention layers by merging consecutive tokens along the temporal dimension. Third, Concurrent Inference with Dynamic Loading (CI-DL) dynamically partitions large models into smaller blocks and loads them into memory for concurrent model inference, effectively addressing the challenges of limited device memory. We implement On-device Sora on the iPhone 15 Pro, and the experimental evaluations show that it is capable of generating high-quality videos on the device, comparable to those produced by high-end GPUs. These results show that On-device Sora enables efficient and high-quality video generation on resource-constrained mobile devices. We envision the proposed On-device Sora as a significant first step toward democratizing state-of-the-art generative technologies, enabling video generation on commodity mobile and embedded devices without resource-intensive re-training for model optimization (compression). The code implementation is available at a GitHub repository(https://github.com/eai-lab/On-device-Sora).
中文: On-device Sora 是一种无需训练的解决方案,通过线性比例跳跃、时间维度令牌合并和动态加载并发推理三项新技术,在智能手机上实现高效文本到视频生成,其生成质量可与高端GPU相媲美。
English: On-device Sora is a training-free solution that enables efficient text-to-video generation on smartphones through three novel techniques—Linear Proportional Leap, Temporal Dimension Token Merging, and Concurrent Inference with Dynamic Loading—producing high-quality videos comparable to those from high-end GPUs.

Authors:Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan
Title: CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
Abstract:
Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.
中文: CodeSteer是一种创新方法,能有效引导大型语言模型在文本推理与代码生成间切换,显著提升其在各类任务中的符号计算性能。
English: CodeSteer is an innovative method that enhances LLMs' ability to switch between textual reasoning and code generation, significantly boosting their symbolic computing performance across diverse tasks.

Authors:Juyun Wee, Minjae Park, Jaeho Lee
Title: Prompt-based Depth Pruning of Large Language Models
Abstract:
Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent -- a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
Chinese: 深度剪枝通过移除不太重要的Transformer模块来降低大型语言模型的推理成本,而新提出的PuDDing方法根据输入提示动态调整模块剪枝策略,相比静态方法在任务特定性能和效率上表现更优。
English: Depth pruning reduces large language model inference costs by removing less important transformer blocks, but a new dynamic method called PuDDing adapts block removal based on input prompts for better task-specific performance and efficiency than static approaches.

Authors:Saydul Akbar Murad, Ashim Dahal, Nick Rahimi
Title: Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis
Abstract:
Cyber threat detection has become an important area of focus in today's digital age due to the growing spread of fake information and harmful content on social media platforms such as Twitter (now 'X'). These cyber threats, often disguised within tweets, pose significant risks to individuals, communities, and even nations, emphasizing the need for effective detection systems. While previous research has explored tweet-based threats, much of the work is limited to specific languages, domains, or locations, or relies on single-model approaches, reducing their applicability to diverse real-world scenarios. To address these gaps, our study focuses on multi-lingual tweet cyber threat detection using a variety of advanced models. The research was conducted in three stages: (1) We collected and labeled tweet datasets in four languages English, Chinese, Russian, and Arabic employing both manual and polarity-based labeling methods to ensure high-quality annotations. (2) Each dataset was analyzed individually using machine learning (ML) and deep learning (DL) models to assess their performance on distinct languages. (3) Finally, we combined all four datasets into a single multi-lingual dataset and applied DL and large language model (LLM) architectures to evaluate their efficacy in identifying cyber threats across various languages. Our results show that among machine learning models, Random Forest (RF) attained the highest performance; however, the Bi-LSTM architecture consistently surpassed other DL and LLM architectures across all datasets. These findings underline the effectiveness of Bi-LSTM in multilingual cyber threat detection. The code for this paper can be found at this link: https://github.com/Mmurrad/Tweet-Data-Classification.git.
中文: 本研究通过结合多种机器学习与深度学习模型,开发了针对多语言推文的网络威胁检测方法,其中双向长短期记忆网络(Bi-LSTM)在所有语言数据集中均表现出最优性能。
English: This study addresses the limitations of previous cyber threat detection methods on Twitter by developing a multilingual approach using machine learning and deep learning models, with Bi-LSTM emerging as the most effective across diverse languages.

Authors:Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
Title: Ola: Pushing the Frontiers of Omni-Modal Language Model
Abstract:
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.
中文摘要:本文介绍了Ola全模态语言模型,通过架构改进和渐进式训练方法,在图像、视频和音频理解方面实现了与专业模型相媲美的性能,并完全开源以推动该领域未来发展。
English summary: The paper introduces Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding through architectural improvements and a progressive training approach, while being fully open-sourced to advance future research.

Authors:Yiming Huang, Tolga Birdal
Title: HOG-Diff: Higher-Order Guided Diffusion for Graph Generation
Abstract:
Graph generation is a critical yet challenging task as empirical analyses require a deep understanding of complex, non-Euclidean structures. Although diffusion models have recently made significant achievements in graph generation, these models typically adapt from the frameworks designed for image generation, making them ill-suited for capturing the topological properties of graphs. In this work, we propose a novel Higher-order Guided Diffusion (HOG-Diff) model that follows a coarse-to-fine generation curriculum and is guided by higher-order information, enabling the progressive generation of plausible graphs with inherent topological structures. We further prove that our model exhibits a stronger theoretical guarantee than classical diffusion frameworks. Extensive experiments on both molecular and generic graph generation tasks demonstrate that our method consistently outperforms or remains competitive with state-of-the-art baselines. Our code is available at https://github.com/Yiminghh/HOG-Diff.
中文: 该摘要提出HOG-Diff,一种高阶引导扩散框架,通过扩散桥实现从粗到细的生成过程,在分子和通用图生成任务中展现出优于或媲美现有先进方法的性能。
English: The abstract introduces HOG-Diff, a higher-order guided diffusion framework that generates graphs with inherent topological structures, demonstrating superior theoretical guarantees and competitive performance in molecular and generic graph generation tasks.

Authors:Yiming Huang, Tolga Birdal
Title: HOG-Diff: Higher-Order Guided Diffusion for Graph Generation
Abstract:
Graph generation is a critical yet challenging task as empirical analyses require a deep understanding of complex, non-Euclidean structures. Diffusion models have recently made significant achievements in graph generation, but these models are typically adapted from image generation frameworks and overlook inherent higher-order topology, leaving them ill-suited for capturing the topological properties of graphs. In this work, we propose Higher-order Guided Diffusion (HOG-Diff), a principled framework that progressively generates plausible graphs with inherent topological structures. HOG-Diff follows a coarse-to-fine generation curriculum guided by higher-order topology and implemented via diffusion bridges. We further prove that our model exhibits a stronger theoretical guarantee than classical diffusion frameworks. Extensive experiments on both molecular and generic graph generation tasks demonstrate that our method consistently outperforms or remains competitive with state-of-the-art baselines. Our code is available at https://github.com/Yiminghh/HOG-Diff.
中文: 该摘要提出HOG-Diff,一种高阶引导扩散框架,通过扩散桥实现从粗到细的生成过程,在分子和通用图生成任务中展现出优于或媲美现有先进方法的性能。
English: The abstract introduces HOG-Diff, a higher-order guided diffusion framework that generates graphs with inherent topological structures, demonstrating superior theoretical guarantees and competitive performance in molecular and generic graph generation tasks.

Authors:Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam
Title: ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
Abstract:
Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow
中文:ScoreFlow采用基于梯度的优化框架和创新的Score-DPO方法,显著提升了自动化智能体工作流程的效率,在多项基准测试中性能提高8.2%,并让小模型以更低成本超越大模型表现。
English: ScoreFlow introduces a gradient-based optimization framework with a novel Score-DPO method to enhance automated agent workflow efficiency, achieving an 8.2% performance boost across benchmarks and enabling smaller models to outperform larger ones cost-effectively.

Authors:Yuanye Liu, Jiahang Xu, Li Lyna Zhang, Qi Chen, Xuan Feng, Yang Chen, Zhongxin Guo, Yuqing Yang, Peng Cheng
Title: Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization
Abstract:
Large Language Models (LLMs) have shown significant capability across various tasks, with their real-world effectiveness often driven by prompt design. While recent research has focused on optimizing prompt content, the role of prompt formatting, a critical but often overlooked dimension, has received limited systematic investigation. In this paper, we introduce Content-Format Integrated Prompt Optimization (CFPO), an innovative methodology that jointly optimizes both prompt content and formatting through an iterative refinement process. CFPO leverages natural language mutations to explore content variations and employs a dynamic format exploration strategy that systematically evaluates diverse format options. Our extensive evaluations across multiple tasks and open-source LLMs demonstrate that CFPO demonstrates measurable performance improvements compared to content-only optimization methods. This highlights the importance of integrated content-format optimization and offers a practical, model-agnostic approach to enhancing LLM performance. Code is available at https://github.com/HenryLau7/CFPO.
Chinese: CFPO是一种通过迭代优化同时改进提示内容和格式的新方法,在多项任务和开源大语言模型上相比仅优化内容的方法均展现出显著性能提升。
English: CFPO is a novel method that jointly optimizes both prompt content and formatting through iterative refinement, demonstrating measurable performance improvements over content-only optimization across various tasks and LLMs.

Authors:Yi Yu, Botao Ren, Peiyuan Zhang, Mingxin Liu, Junwei Luo, Shaofeng Zhang, Feipeng Da, Junchi Yan, Xue Yang
Title: Point2RBox-v2: Rethinking Point-supervised Oriented Object Detection with Spatial Layout Among Instances
Abstract:
With the rapidly increasing demand for oriented object detection (OOD), recent research involving weakly-supervised detectors for learning OOD from point annotations has gained great attention. In this paper, we rethink this challenging task setting with the layout among instances and present Point2RBox-v2. At the core are three principles: 1) Gaussian overlap loss. It learns an upper bound for each instance by treating objects as 2D Gaussian distributions and minimizing their overlap. 2) Voronoi watershed loss. It learns a lower bound for each instance through watershed on Voronoi tessellation. 3) Consistency loss. It learns the size/rotation variation between two output sets with respect to an input image and its augmented view. Supplemented by a few devised techniques, e.g. edge loss and copy-paste, the detector is further enhanced. To our best knowledge, Point2RBox-v2 is the first approach to explore the spatial layout among instances for learning point-supervised OOD. Our solution is elegant and lightweight, yet it is expected to give a competitive performance especially in densely packed scenes: 62.61%/86.15%/34.71% on DOTA/HRSC/FAIR1M. Code is available at https://github.com/VisionXLab/point2rbox-v2.
Chinese: Point2RBox-v2提出了一种基于点标注的弱监督旋转目标检测新方法,通过高斯重叠、Voronoi分水岭和一致性损失三大核心原则学习实例间的空间布局,在密集场景中实现了优越性能。
English: Point2RBox-v2 introduces a novel weakly-supervised approach for oriented object detection using point annotations, employing three key principles—Gaussian overlap, Voronoi watershed, and consistency losses—to effectively learn instance layouts and achieve competitive results in densely packed scenes.

Authors:Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Andrew D. Bagdanov
Title: Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Abstract:
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly improves performance with respect to intra-modal baselines on more than fifteen datasets. Additionally, we demonstrate that approaching a native inter-modal task (e.g. zero-shot image classification) intra-modally decreases performance, further validating our findings. Finally, we show that incorporating an intra-modal term in the pre-training objective or narrowing the modality gap between the text and image feature embedding spaces helps reduce the intra-modal misalignment. The code is publicly available at: https://github.com/miccunifi/Cross-the-Gap.
中文摘要:本研究发现,由于模态内不对齐问题,在图像检索等模态内任务中单独使用CLIP的文本或图像编码器效果欠佳,但通过跨模态映射方法能显著提升性能。
English Summary: This study reveals that using CLIP's individual text or image encoders for intra-modal tasks is suboptimal due to intra-modal misalignment, but performance significantly improves by adopting inter-modal approaches through modality inversion techniques.

Authors:Shaopeng Fu, Liang Ding, Jingfeng Zhang, Di Wang
Title: Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence
Abstract:
Jailbreak attacks against large language models (LLMs) aim to induce harmful behaviors in LLMs through carefully crafted adversarial prompts. To mitigate attacks, one way is to perform adversarial training (AT)-based alignment, i.e., training LLMs on some of the most adversarial prompts to help them learn how to behave safely under attacks. During AT, the length of adversarial prompts plays a critical role in the robustness of aligned LLMs. While long-length adversarial prompts during AT might lead to strong LLM robustness, their synthesis however is very resource-consuming, which may limit the application of LLM AT. This paper focuses on adversarial suffix jailbreak attacks and unveils that to defend against a jailbreak attack with an adversarial suffix of length $Θ(M)$, it is enough to align LLMs on prompts with adversarial suffixes of length $Θ(\sqrt{M})$. Theoretically, we analyze the adversarial in-context learning of linear transformers on linear regression tasks and prove a robust generalization bound for trained transformers. The bound depends on the term $Θ(\sqrt{M_{\text{test}}}/M_{\text{train}})$, where $M_{\text{train}}$ and $M_{\text{test}}$ are the numbers of adversarially perturbed in-context samples during training and testing. Empirically, we conduct AT on popular open-source LLMs and evaluate their robustness against jailbreak attacks of different adversarial suffix lengths. Results confirm a positive correlation between the attack success rate and the ratio of the square root of the adversarial suffix length during jailbreaking to the length during AT. Our findings show that it is practical to defend against ``long-length'' jailbreak attacks via efficient ``short-length'' AT. The code is available at https://github.com/fshp971/adv-icl.
中文: 本研究通过理论分析和实证验证表明,针对大型语言模型的长对抗后缀越狱攻击,可以通过使用显著缩短的对抗提示进行高效对抗训练来实现有效防御。
English: This study demonstrates that defending against long adversarial suffix jailbreak attacks in large language models can be effectively achieved through efficient adversarial training using significantly shorter adversarial prompts, as confirmed by both theoretical analysis and empirical results.

Authors:Qinhan Yu, Zhiyou Xiao, Binghui Li, Zhengren Wang, Chong Chen, Wentao Zhang
Title: MRAMG-Bench: A Comprehensive Benchmark for Advancing Multimodal Retrieval-Augmented Multimodal Generation
Abstract:
Recent advances in Retrieval-Augmented Generation (RAG) have significantly improved response accuracy and relevance by incorporating external knowledge into Large Language Models (LLMs). However, existing RAG methods primarily focus on generating text-only answers, even in Multimodal Retrieval-Augmented Generation (MRAG) scenarios, where multimodal elements are retrieved to assist in generating text answers. To address this, we introduce the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task, in which we aim to generate multimodal answers that combine both text and images, fully leveraging the multimodal data within a corpus. Despite growing attention to this challenging task, a notable lack of a comprehensive benchmark persists for effectively evaluating its performance. To bridge this gap, we provide MRAMG-Bench, a meticulously curated, human-annotated benchmark comprising 4,346 documents, 14,190 images, and 4,800 QA pairs, distributed across six distinct datasets and spanning three domains: Web, Academia, and Lifestyle. The datasets incorporate diverse difficulty levels and complex multi-image scenarios, providing a robust foundation for evaluating the MRAMG task. To facilitate rigorous evaluation, MRAMG-Bench incorporates a comprehensive suite of both statistical and LLM-based metrics, enabling a thorough analysis of the performance of generative models in the MRAMG task. Additionally, we propose an efficient and flexible multimodal answer generation framework that can leverage LLMs/MLLMs to generate multimodal responses. Our datasets and complete evaluation results for 11 popular generative models are available at https://github.com/MRAMG-Bench/MRAMG.
中文: 本文提出了多模态检索增强多模态生成(MRAMG)任务,旨在生成图文结合的答案,并通过提供包含丰富数据和评估指标的MRAMG-Bench来弥补该领域基准的缺失。
English: This paper introduces the Multimodal Retrieval-Augmented Multimodal Generation (MRAMG) task for generating combined text and image answers, addressing the lack of benchmarks by providing MRAMG-Bench with extensive data and metrics for evaluation.

Authors:Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang
Title: UltraIF: Advancing Instruction Following from the Wild
Abstract:
Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code will be available at https://github.com/kkk-an/UltraIF.
中文摘要:UltraIF是一种通过将复杂指令分解为简单组件并训练合成器来生成和评估指令的可扩展方法,有效缩小了开源与领先大语言模型之间的性能差距,成功使基础模型在多个基准测试中达到指导版本的同等水平。
English Summary: UltraIF is a scalable method that bridges the performance gap between open-source and leading LLMs by decomposing complex instructions into simpler components and training a composer to synthesize and evaluate them, successfully aligning base models to match their instruct versions on benchmarks.

Authors:Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang
Title: UltraIF: Advancing Instruction Following from the Wild
Abstract:
Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code is available at https://github.com/kkk-an/UltraIF.
中文摘要:UltraIF是一种通过将复杂指令分解为简单组件并训练合成器来生成和评估指令的可扩展方法,有效缩小了开源与领先大语言模型之间的性能差距,成功使基础模型在多个基准测试中达到指导版本的同等水平。
English Summary: UltraIF is a scalable method that bridges the performance gap between open-source and leading LLMs by decomposing complex instructions into simpler components and training a composer to synthesize and evaluate them, successfully aligning base models to match their instruct versions on benchmarks.

Authors:Ahmed Adnan, Antu Saha, Oscar Chaparro
Title: SPRINT: An Assistant for Issue Report Management
Abstract:
Managing issue reports is essential for the evolution and maintenance of software systems. However, manual issue management tasks such as triaging, prioritizing, localizing, and resolving issues are highly resource-intensive for projects with large codebases and users. To address this challenge, we present SPRINT, a GitHub application that utilizes state-of-the-art deep learning techniques to streamline issue management tasks. SPRINT assists developers by: (i) identifying existing issues similar to newly reported ones, (ii) predicting issue severity, and (iii) suggesting code files that likely require modification to solve the issues. We evaluated SPRINT using existing datasets and methodologies, measuring its predictive performance, and conducted a user study with five professional developers to assess its usability and usefulness. The results show that SPRINT is accurate, usable, and useful, providing evidence of its effectiveness in assisting developers in managing issue reports. SPRINT is an open-source tool available at https://github.com/sea-lab-wm/sprint_issue_report_assistant_tool.
中文: SPRINT是一款基于深度学习的GitHub应用,能通过识别重复问题、预测严重性和推荐相关代码文件来自动化问题管理,经评估和用户研究证明其具有实用价值。
English: SPRINT is a GitHub application that uses deep learning to automate issue management by identifying duplicate issues, predicting severity, and suggesting relevant code files, proving effective through evaluations and user studies.

Authors:Jost Arndt, Utku Isil, Michael Detzel, Wojciech Samek, Jackie Ma
Title: Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs
Abstract:
Many physical processes can be expressed through partial differential equations (PDEs). Real-world measurements of such processes are often collected at irregularly distributed points in space, which can be effectively represented as graphs; however, there are currently only a few existing datasets. Our work aims to make advancements in the field of PDE-modeling accessible to the temporal graph machine learning community, while addressing the data scarcity problem, by creating and utilizing datasets based on PDEs. In this work, we create and use synthetic datasets based on PDEs to support spatio-temporal graph modeling in machine learning for different applications. More precisely, we showcase three equations to model different types of disasters and hazards in the fields of epidemiology, atmospheric particles, and tsunami waves. Further, we show how such created datasets can be used by benchmarking several machine learning models on the epidemiological dataset. Additionally, we show how pre-training on this dataset can improve model performance on real-world epidemiological data. The presented methods enable others to create datasets and benchmarks customized to individual requirements. The source code for our methodology and the three created datasets can be found on https://github.com/github-usr-ano/Temporal_Graph_Data_PDEs.
中文: 本研究通过创建基于三个灾害建模偏微分方程的合成数据集,解决了时空图机器学习领域的数据稀缺问题,并通过基准测试和预训练应用展示了其实用价值。
English: This work addresses data scarcity in PDE-based temporal graph machine learning by creating synthetic datasets from three disaster-modeling equations and demonstrating their utility through benchmarking and pre-training applications.

Authors:Shangkun Sun, Xiaoyu Liang, Bowen Qu, Wei Gao
Title: Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency
Abstract:
The advent of next-generation video generation models like \textit{Sora} poses challenges for AI-generated content (AIGC) video quality assessment (VQA). These models substantially mitigate flickering artifacts prevalent in prior models, enable longer and complex text prompts and generate longer videos with intricate, diverse motion patterns. Conventional VQA methods designed for simple text and basic motion patterns struggle to evaluate these content-rich videos. To this end, we propose \textbf{CRAVE} (\underline{C}ontent-\underline{R}ich \underline{A}IGC \underline{V}ideo \underline{E}valuator), specifically for the evaluation of Sora-era AIGC videos. CRAVE proposes the multi-granularity text-temporal fusion that aligns long-form complex textual semantics with video dynamics. Additionally, CRAVE leverages the hybrid motion-fidelity modeling to assess temporal artifacts. Furthermore, given the straightforward prompts and content in current AIGC VQA datasets, we introduce \textbf{CRAVE-DB}, a benchmark featuring content-rich videos from next-generation models paired with elaborate prompts. Extensive experiments have shown that the proposed CRAVE achieves excellent results on multiple AIGC VQA benchmarks, demonstrating a high degree of alignment with human perception. All data and code will be publicly available at https://github.com/littlespray/CRAVE.
中文: 提出的CRAVE模型通过多粒度文本-时序融合和混合运动-保真度建模,有效解决了评估内容丰富的AIGC视频的挑战,实现了与人类感知的高度一致。
English: The proposed CRAVE model effectively addresses the challenge of assessing content-rich AIGC videos by aligning complex text semantics with video dynamics and modeling temporal artifacts, achieving high alignment with human perception.

Authors:Nikunj Gupta, James Zachary Hare, Rajgopal Kannan, Viktor Prasanna
Title: Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning
Abstract:
This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. However, existing approaches rely solely on pairwise relations between agents, which potentially oversimplifies complex multi-agent interactions. DMCG goes beyond these simple direct interactions by also capturing useful higher-order and indirect relationships among agents. It generates novel graph structures accommodating multiple types of interactions and arbitrary lengths of multi-hop connections in coordination graphs to model such interactions. It then employs a graph convolutional network module to learn powerful representations in an end-to-end manner. We demonstrate its effectiveness in multiple coordination problems in MARL where other state-of-the-art methods can suffer from sample inefficiency or fail entirely. All codes can be found here: https://github.com/Nikunj-Gupta/dmcg-marl.
Chinese: 本文提出深度元协调图(DMCG),通过构建新型图结构和图卷积网络来建模智能体间的高阶间接交互,在多智能体强化学习的复杂协作任务中显著优于现有方法。
English: This paper introduces Deep Meta Coordination Graphs (DMCG), which enhance multi-agent reinforcement learning by modeling higher-order and indirect agent interactions through novel graph structures and graph convolutional networks, outperforming existing methods in complex coordination tasks.

Authors:Keonvin Park, Jisu Kim, Jaemin Seo
Title: PINT: Physics-Informed Neural Time Series Models with Applications to Long-term Inference on WeatherBench 2m-Temperature Data
Abstract:
This paper introduces PINT (Physics-Informed Neural Time Series Models), a framework that integrates physical constraints into neural time series models to improve their ability to capture complex dynamics. We apply PINT to the ERA5 WeatherBench dataset, focusing on long-term forecasting of 2m-temperature data. PINT incorporates the Simple Harmonic Oscillator Equation as a physics-informed prior, embedding its periodic dynamics into RNN, LSTM, and GRU architectures. This equation's analytical solutions (sine and cosine functions) facilitate rigorous evaluation of the benefits of incorporating physics-informed constraints. By benchmarking against a linear regression baseline derived from its exact solutions, we quantify the impact of embedding physical principles in data-driven models. Unlike traditional time series models that rely on future observations, PINT is designed for practical forecasting. Using only the first 90 days of observed data, it iteratively predicts the next two years, addressing challenges posed by limited real-time updates. Experiments on the WeatherBench dataset demonstrate PINT's ability to generalize, capture periodic trends, and align with physical principles. This study highlights the potential of physics-informed neural models in bridging machine learning and interpretable climate applications. Our models and datasets are publicly available on GitHub: https://github.com/KV-Park.
中文: 本文提出PINT物理约束神经网络框架,通过将简谐振荡方程嵌入时序模型,仅用90天观测数据即可实现两年期温度预测,有效提升气象应用的物理一致性与泛化能力。
English: This paper presents PINT, a physics-informed neural framework that embeds harmonic oscillator dynamics into time series models to enhance long-term weather forecasting accuracy and physical consistency using limited initial data.

Authors:Yu Yuan, Shizhao Sun, Qi Liu, Jiang Bian
Title: CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing
Abstract:
Computer Aided Design (CAD) is indispensable across various industries. \emph{Text-based CAD editing}, which automates the modification of CAD models based on textual instructions, holds great potential but remains underexplored. Existing methods primarily focus on design variation generation or text-based CAD generation, either lacking support for text-based control or neglecting existing CAD models as constraints. We introduce \emph{CAD-Editor}, the first framework for text-based CAD editing. To address the challenge of demanding triplet data with accurate correspondence for training, we propose an automated data synthesis pipeline. This pipeline utilizes design variation models to generate pairs of original and edited CAD models and employs Large Vision-Language Models (LVLMs) to summarize their differences into editing instructions. To tackle the composite nature of text-based CAD editing, we propose a locate-then-infill framework that decomposes the task into two focused sub-tasks: locating regions requiring modification and infilling these regions with appropriate edits. Large Language Models (LLMs) serve as the backbone for both sub-tasks, leveraging their capabilities in natural language understanding and CAD knowledge. Experiments show that CAD-Editor achieves superior performance both quantitatively and qualitatively. The code is available at \url {https://github.com/microsoft/CAD-Editor}.
中文摘要:CAD-Editor是首个基于文本的CAD编辑框架,通过自动化数据合成流程和结合大型语言模型的"定位-填充"方法,实现了通过文本指令修改CAD模型的卓越性能。
English Summary: CAD-Editor is the first framework for text-based CAD editing, using an automated data synthesis pipeline and a locate-then-infill approach with LLMs to achieve superior performance in modifying CAD models through textual instructions.

Authors:Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck
Title: Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering
Abstract:
Most existing Knowledge Graph Question Answering (KGQA) approaches are designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the heterogeneity of the underlying graph schema, topology and assertions, most KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without resource-intensive training data. We present OntoSCPrompt, a novel Large Language Model (LLM)-based KGQA approach with a two-stage architecture that separates semantic parsing from KG-dependent interactions. OntoSCPrompt first generates a SPARQL query structure (including SPARQL keywords such as SELECT, ASK, WHERE and placeholders for missing tokens) and then fills them with KG-specific information. To enhance the understanding of the underlying KG, we present an ontology-guided, hybrid prompt learning strategy that integrates KG ontology into the learning process of hybrid prompts (e.g., discrete and continuous vectors). We also present several task-specific decoding strategies to ensure the correctness and executability of generated SPARQL queries in both stages. Experimental results demonstrate that OntoSCPrompt performs as well as SOTA approaches without retraining on a number of KGQA datasets such as CWQ, WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: \href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
中文: OntoSCPrompt提出了一种新颖的两阶段大语言模型知识图谱问答方法,通过分离语义解析与图谱交互,并采用本体引导的混合提示学习策略,无需重新训练即可高效泛化到未见过的知识图谱。
English: OntoSCPrompt introduces a novel two-stage LLM-based KGQA approach that separates semantic parsing from KG interactions, enabling efficient generalization to unseen knowledge graphs without retraining through ontology-guided prompts and task-specific decoding strategies.

Authors:Priyank Pathak, Shyam Marjit, Shruti Vyas, Yogesh S Rawat
Title: LR0.FM: Low-Res Benchmark and Improving Robustness for Zero-Shot Classification in Foundation Models
Abstract:
Visual-language foundation Models (FMs) exhibit remarkable zero-shot generalization across diverse tasks, largely attributed to extensive pre-training on largescale datasets. However, their robustness on low-resolution/pixelated (LR) images, a common challenge in real-world scenarios, remains underexplored. We introduce LR0.FM, a comprehensive benchmark evaluating the impact of low resolution on the zero-shot classification performance of 10 FM(s) across 66 backbones and 15 datasets. We propose a novel metric, Weighted Aggregated Robustness, to address the limitations of existing metrics and better evaluate model performance across resolutions and datasets. Our key findings show that: (i) model size positively correlates with robustness to resolution degradation, (ii) pre-training dataset quality is more important than its size, and (iii) fine-tuned and higher resolution models are less robust against LR. Our analysis further reveals that the model makes semantically reasonable predictions at LR, and the lack of fine-grained details in input adversely impacts the model's initial layers more than the deeper layers. We use these insights and introduce a simple strategy, LR-TK0, to enhance the robustness of models without compromising their pre-trained weights. We demonstrate the effectiveness of LR-TK0 for robustness against low-resolution across several datasets and its generalization capability across backbones and other approaches. Code is available at https://github.com/shyammarjit/LR0.FM
中文摘要:视觉语言基础模型在零样本任务中表现出色,但在低分辨率图像上的鲁棒性不足;为此提出的LR0.FM基准和LR-TK0增强策略,可在不修改预训练权重的情况下有效提升模型对低分辨率输入的适应能力。
English Summary: Visual-language foundation models show strong zero-shot capabilities but struggle with low-resolution images, leading to the development of the LR0.FM benchmark and a simple enhancement strategy called LR-TK0 that improves robustness without altering pre-trained weights.

Authors:Yousef Koka, David Selby, Gerrit Großmann, Sebastian Vollmer
Title: CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning
Abstract:
Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents 'CleanSurvival', a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: https://github.com/datasciapps/CleanSurvival Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data.
中文: 本文提出"CleanSurvival"框架,基于强化学习为生存分析任务自动化数据预处理,相比传统方法能显著提升模型性能并加快优化速度。
English: This paper introduces "CleanSurvival," a reinforcement learning-based framework that automates data preprocessing for survival analysis, enhancing model performance and efficiency compared to standard methods.

Authors:Minsang Kim, Seungjun Baek
Title: Syntriever: How to Train Your Retriever with Synthetic Data from LLMs
Abstract:
LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@$K$. The code is available at \href{https://github.com/kmswin1/Syntriever}{https://github.com/kmswin1/Syntriever}.
中文: Syntriever是一种创新的训练框架,通过合成数据和包含蒸馏与对齐的两阶段过程,将黑盒大语言模型的知识提炼到检索系统中,并在多个领域的基准数据集上取得了领先性能。
English: Syntriever is a novel training framework that distills knowledge from black-box LLMs into retrieval systems using synthetic data and a two-stage process involving distillation and alignment, achieving state-of-the-art performance across multiple domains.

Authors:Heyi Zhang, Yule Liu, Xinlei He, Jun Wu, Tianshuo Cong, Xinyi Huang
Title: SoK: Benchmarking Poisoning Attacks and Defenses in Federated Learning
Abstract:
Federated learning (FL) enables collaborative model training while preserving data privacy, but its decentralized nature exposes it to client-side data poisoning attacks (DPAs) and model poisoning attacks (MPAs) that degrade global model performance. While numerous proposed defenses claim substantial effectiveness, their evaluation is typically done in isolation with limited attack strategies, raising concerns about their validity. Additionally, existing studies overlook the mutual effectiveness of defenses against both DPAs and MPAs, causing fragmentation in this field. This paper aims to provide a unified benchmark and analysis of defenses against DPAs and MPAs, clarifying the distinction between these two similar but slightly distinct domains. We present a systematic taxonomy of poisoning attacks and defense strategies, outlining their design, strengths, and limitations. Then, a unified comparative evaluation across FL algorithms and data heterogeneity is conducted to validate their individual and mutual effectiveness and derive key insights for design principles and future research. Along with the analysis, we frame our work to a unified benchmark, FLPoison, with high modularity and scalability to evaluate 15 representative poisoning attacks and 17 defense strategies, facilitating future research in this domain. Code is available at https://github.com/vio1etus/FLPoison.
Chinese Summary: 本文提出了FLPoison统一基准,系统评估了联邦学习中的15种投毒攻击和17种防御策略,以解决该领域研究碎片化问题,并验证它们对数据和模型投毒威胁的共同防御效果。
English Summary: This paper introduces FLPoison, a unified benchmark that systematically evaluates 15 poisoning attacks and 17 defense strategies in federated learning to address fragmented research and validate their effectiveness against both data and model poisoning threats.

Authors:Xiangyu Wu, Feng Yu, Qing-Guo Chen, Yang Yang, Jianfeng Lu
Title: Multi-Label Test-Time Adaptation with Bound Entropy Minimization
Abstract:
Mainstream test-time adaptation (TTA) techniques endeavor to mitigate distribution shifts via entropy minimization for multi-class classification, inherently increasing the probability of the most confident class. However, when encountering multi-label instances, the primary challenge stems from the varying number of labels per image, and prioritizing only the highest probability class inevitably undermines the adaptation of other positive labels. To address this issue, we investigate TTA within multi-label scenario (ML--TTA), developing Bound Entropy Minimization (BEM) objective to simultaneously increase the confidence of multiple top predicted labels. Specifically, to determine the number of labels for each augmented view, we retrieve a paired caption with yielded textual labels for that view. These labels are allocated to both the view and caption, called weak label set and strong label set with the same size k. Following this, the proposed BEM considers the highest top-k predicted labels from view and caption as a single entity, respectively, learning both view and caption prompts concurrently. By binding top-k predicted labels, BEM overcomes the limitation of vanilla entropy minimization, which exclusively optimizes the most confident class. Across the MSCOCO, VOC, and NUSWIDE multi-label datasets, our ML--TTA framework equipped with BEM exhibits superior performance compared to the latest SOTA methods, across various model architectures, prompt initialization, and varying label scenarios. The code is available at https://github.com/Jinx630/ML-TTA.
中文摘要:本研究针对多标签测试时适应场景,提出边界熵最小化方法,通过利用配对标题同时提升多个预测标签的置信度,有效克服了传统熵最小化仅优化最置信类别的局限性。
English Summary: This study introduces a Bound Entropy Minimization (BEM) method for multi-label test-time adaptation, which simultaneously boosts confidence in multiple top predicted labels by leveraging paired captions and overcoming limitations of traditional entropy minimization.

Authors:Chhavi Yadav, Evan Monroe Laufer, Dan Boneh, Kamalika Chaudhuri
Title: ExpProof : Operationalizing Explanations for Confidential Models with ZKPs
Abstract:
In principle, explanations are intended as a way to increase trust in machine learning models and are often obligated by regulations. However, many circumstances where these are demanded are adversarial in nature, meaning the involved parties have misaligned interests and are incentivized to manipulate explanations for their purpose. As a result, explainability methods fail to be operational in such settings despite the demand \cite{bordt2022post}. In this paper, we take a step towards operationalizing explanations in adversarial scenarios with Zero-Knowledge Proofs (ZKPs), a cryptographic primitive. Specifically we explore ZKP-amenable versions of the popular explainability algorithm LIME and evaluate their performance on Neural Networks and Random Forests. Our code is publicly available at https://github.com/emlaufer/ExpProof.
Chinese: 本文针对对抗性场景下可解释性方法的失效问题,提出利用零知识证明(ZKP)实现可解释性的操作化,特别开发了兼容ZKP的LIME算法版本,并在神经网络和随机森林上进行了性能评估。
English: This paper addresses the failure of explainability methods in adversarial settings by proposing Zero-Knowledge Proofs (ZKP) to operationalize explanations, specifically developing ZKP-amenable versions of LIME and evaluating them on Neural Networks and Random Forests.

Authors:Chaoyin She, Ruifang Lu, Danni He, Jiayi Lv, Yadan Lin, Meiqing Cheng, Hui Huang, Fengyu Ye, Lida Chen, Wei Wang, Qinghua Huang
Title: A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma
Abstract:
Hepatocellular carcinoma (HCC), ranking as the third leading cause of cancer-related mortality worldwide, demands urgent improvements in early detection to enhance patient survival. While ultrasound remains the preferred screening modality due to its cost-effectiveness and real-time capabilities, its sensitivity (59%-78%) heavily relies on radiologists' expertise, leading to inconsistent diagnostic outcomes and operational inefficiencies. Recent advancements in AI technology offer promising solutions to bridge this gap. This study introduces the Hierarchical Sparse Query Transformer (HSQformer), a novel hybrid architecture that synergizes CNNs' local feature extraction with Vision Transformers' global contextual awareness through latent space representation and sparse learning. By dynamically activating task-specific experts via a Mixture-of-Experts (MoE) framework, HSQformer achieves hierarchical feature integration without structural redundancy. Evaluated across three clinical scenarios: single-center, multi-center, and high-risk patient cohorts, HSQformer outperforms state-of-the-art models (e.g., 95.38% AUC in multi-center testing) and matches senior radiologists' diagnostic accuracy while significantly surpassing junior counterparts. These results highlight the potential of AI-assisted tools to standardize HCC screening, reduce dependency on human expertise, and improve early diagnosis rates. The full code is available at https://github.com/Asunatan/HSQformer.
中文: 本研究提出的HSQformer模型融合了CNN与视觉Transformer,通过多中心临床验证不仅超越现有检测方法,更达到资深放射科医生诊断水平,为肝细胞癌早期筛查提供了标准化解决方案。
English: The study introduces HSQformer, an AI model combining CNNs and Vision Transformers, which outperforms existing methods in detecting hepatocellular carcinoma across multiple clinical settings and matches senior radiologists' accuracy, offering a standardized screening solution to improve early diagnosis.

Authors:Pouya Samanipour, Hasan Poonawala
Title: Replacing K-infinity Function with Leaky ReLU in Barrier Function Design: A Union of Invariant Sets Approach for ReLU-Based Dynamical Systems
Abstract:
In this paper, a systematic framework is presented for determining piecewise affine PWA barrier functions and their corresponding invariant sets for dynamical systems identified via Rectified Linear Unit (ReLU) neural networks or their equivalent PWA representations. A common approach to determining the invariant set is to use Nagumo's condition, or to utilize the barrier function with a class K-infinity function. It may be challenging to find a suitable class K-infinity function in some cases. We propose leaky ReLU as an efficient substitute for the complex nonlinear K-infinity function in our formulation. Moreover, we propose the Union of Invariant Sets (UIS) method, which combines information from multiple invariant sets in order to compute the largest possible PWA invariant set. The proposed framework is validated through multiple examples, showcasing its potential to enhance the analysis of invariant sets in ReLU-based dynamical systems. Our code is available at: https://github.com/PouyaSamanipour/UIS.git.
中文: 本文提出了一个系统框架,用于计算基于ReLU神经网络的动态系统中的分段仿射屏障函数及其不变集,通过引入leaky ReLU替代复杂非线性函数,并采用不变集并集方法来最大化不变集范围。
English: This paper introduces a systematic framework for computing piecewise affine barrier functions and invariant sets in ReLU neural network-based dynamical systems, proposing leaky ReLU as a substitute for complex nonlinear functions and a Union of Invariant Sets method to maximize invariant set coverage.

Authors:Xiaopeng Li, Shanwen Wang, Shasha Li, Shezheng Song, Bin Ji, Jun Ma, Jie Yu
Title: Rethinking the Residual Distribution of Locate-then-Editing Methods in Model Editing
Abstract:
Model editing is a powerful technique for updating the knowledge of Large Language Models (LLMs). Locate-then-edit methods are a popular class of approaches that first identify the critical layers storing knowledge, then compute the residual of the last critical layer based on the edited knowledge, and finally perform multi-layer updates using a least-squares solution by evenly distributing the residual from the first critical layer to the last. Although these methods achieve promising results, they have been shown to degrade the original knowledge of LLMs. We argue that residual distribution leads to this issue. To explore this, we conduct a comprehensive analysis of residual distribution in locate-then-edit methods from both empirical and theoretical perspectives, revealing that residual distribution introduces editing errors, leading to inaccurate edits. To address this issue, we propose the Boundary Layer UpdatE (BLUE) strategy to enhance locate-then-edit methods. Sequential batch editing experiments on three LLMs and two datasets demonstrate that BLUE not only delivers an average performance improvement of 35.59\%, significantly advancing the state of the art in model editing, but also enhances the preservation of LLMs' general capabilities. Our code is available at https://github.com/xpq-tech/BLUE.
中文: 模型编辑更新大语言模型知识,但现有定位后编辑方法因残差分布导致知识退化,而提出的BLUE策略通过边界层更新解决了该问题,实现了35.59%的性能提升并更好地保持了模型通用能力。
English: Model editing updates Large Language Models' knowledge, but current locate-then-edit methods degrade original knowledge due to residual distribution errors, which the proposed BLUE strategy addresses by improving performance by 35.59% while better preserving general capabilities.

Authors:Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie
Title: Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Abstract:
Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models. Code is available at https://github.com/wangf3014/Patch_Scaling.
中文: 研究表明,在视觉Transformer中减小补丁尺寸能持续提升预测性能,最小1x1像素标记化效果最佳,这一发现适用于多种任务和架构。
English: The study reveals that reducing patch sizes in Vision Transformers consistently enhances predictive performance, with the best results achieved at the smallest 1x1 pixel tokenization, applicable across various tasks and architectures.

Authors:Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan
Title: MD-BERT: Action Recognition in Dark Videos via Dynamic Multi-Stream Fusion and Temporal Modeling
Abstract:
Action recognition in dark, low-light (under-exposed) or noisy videos is a challenging task due to visibility degradation, which can hinder critical spatiotemporal details. This paper proposes MD-BERT, a novel multi-stream approach that integrates complementary pre-processing techniques such as gamma correction and histogram equalization alongside raw dark frames to address these challenges. We introduce the Dynamic Feature Fusion (DFF) module, extending existing attentional fusion methods to a three-stream setting, thereby capturing fine-grained and global contextual information across different brightness and contrast enhancements. The fused spatiotemporal features are then processed by a BERT-based temporal model, which leverages its bidirectional self-attention to effectively capture long-range dependencies and contextual relationships across frames. Extensive experiments on the ARID V1.0 and ARID V1.5 dark video datasets show that MD-BERT outperforms existing methods, establishing a new state-of-the-art performance. Ablation studies further highlight the individual contributions of each input stream and the effectiveness of the proposed DFF and BERT modules. The official website of this work is available at: https://github.com/HrishavBakulBarua/DarkBERT
中文摘要:本文提出MD-BERT多流框架,通过动态特征融合模块整合增强视频输入与BERT时序模型,在暗光视频行为识别任务中实现了最优性能。
English Summary: This paper introduces MD-BERT, a multi-stream framework that combines enhanced video inputs with a BERT-based temporal model to achieve state-of-the-art action recognition in dark videos through dynamic feature fusion.

Authors:Zhouheng Li, Lei Xie, Cheng Hu, Hongye Su
Title: Reduce Lap Time for Autonomous Racing with Curvature-Integrated MPCC Local Trajectory Planning Method
Abstract:
The widespread application of autonomous driving technology has significantly advanced the field of autonomous racing. Model Predictive Contouring Control (MPCC) is a highly effective local trajectory planning method for autonomous racing. However, the traditional MPCC method struggles with racetracks that have significant curvature changes, limiting the performance of the vehicle during autonomous racing. To address this issue, we propose a curvature-integrated MPCC (CiMPCC) local trajectory planning method for autonomous racing. This method optimizes the velocity of the local trajectory based on the curvature of the racetrack centerline. The specific implementation involves mapping the curvature of the racetrack centerline to a reference velocity profile, which is then incorporated into the cost function for optimizing the velocity of the local trajectory. This reference velocity profile is created by normalizing and mapping the curvature of the racetrack centerline, thereby ensuring efficient and performance-oriented local trajectory planning in racetracks with significant curvature. The proposed CiMPCC method has been experimented on a self-built 1:10 scale F1TENTH racing vehicle deployed with ROS platform. The experimental results demonstrate that the proposed method achieves outstanding results on a challenging racetrack with sharp curvature, improving the overall lap time by 11.4%-12.5% compared to other autonomous racing trajectory planning methods. Our code is available at https://github.com/zhouhengli/CiMPCC.
中文: 提出的曲率集成模型预测轮廓控制(CiMPCC)方法通过基于赛道曲率优化轨迹速度,在复杂弯道赛道上相比现有方法实现了11.4%-12.5%的单圈时间提升。
English: The proposed Curvature-integrated Model Predictive Contouring Control (CiMPCC) method enhances autonomous racing performance by optimizing trajectory velocity based on track curvature, achieving 11.4%-12.5% faster lap times on challenging tracks compared to existing methods.

Authors:Kushagra Pandey, Farrin Marouf Sofian, Felix Draxler, Theofanis Karaletsos, Stephan Mandt
Title: Variational Control for Guidance in Diffusion Models
Abstract:
Diffusion models exhibit excellent sample quality, but existing guidance methods often require additional model training or are limited to specific tasks. We revisit guidance in diffusion models from the perspective of variational inference and control, introducing Diffusion Trajectory Matching (DTM) that enables guiding pretrained diffusion trajectories to satisfy a terminal cost. DTM unifies a broad class of guidance methods and enables novel instantiations. We introduce a new method within this framework that achieves state-of-the-art results on several linear, non-linear, and blind inverse problems without requiring additional model training or specificity to pixel or latent space diffusion models. Our code will be available at https://github.com/czi-ai/oc-guidance
中文摘要:扩散轨迹匹配(DTM)通过变分推理统一了预训练扩散模型的多种引导方法,无需额外训练或特定空间适配,即在多种逆问题上实现了最先进的性能。
English Summary: Diffusion Trajectory Matching (DTM) unifies various guidance methods for pretrained diffusion models through variational inference, achieving state-of-the-art performance on multiple inverse problems without requiring additional training or space-specific adaptations.

Authors:Huimin Zeng, Jiacheng Li, Ziqiang Zheng, Zhiwei Xiong
Title: All-in-One Image Compression and Restoration
Abstract:
Visual images corrupted by various types and levels of degradations are commonly encountered in practical image compression. However, most existing image compression methods are tailored for clean images, therefore struggling to achieve satisfying results on these images. Joint compression and restoration methods typically focus on a single type of degradation and fail to address a variety of degradations in practice. To this end, we propose a unified framework for all-in-one image compression and restoration, which incorporates the image restoration capability against various degradations into the process of image compression. The key challenges involve distinguishing authentic image content from degradations, and flexibly eliminating various degradations without prior knowledge. Specifically, the proposed framework approaches these challenges from two perspectives: i.e., content information aggregation, and degradation representation aggregation. Extensive experiments demonstrate the following merits of our model: 1) superior rate-distortion (RD) performance on various degraded inputs while preserving the performance on clean data; 2) strong generalization ability to real-world and unseen scenarios; 3) higher computing efficiency over compared methods. Our code is available at https://github.com/ZeldaM1/All-in-one.
中文摘要:该统一框架将图像修复能力融入压缩过程,无需先验知识即可灵活消除多种退化,在各类场景中实现了优越的性能、泛化能力和计算效率。
English Summary: The proposed unified framework integrates image restoration capabilities into compression to effectively handle various degradations without prior knowledge, achieving superior performance, generalization, and efficiency across diverse scenarios.

Authors:Zhuowei Li, Haizhou Shi, Yunhe Gao, Di Liu, Zhenting Wang, Yuxiao Chen, Ting Liu, Long Zhao, Hao Wang, Dimitris N. Metaxas
Title: The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
Abstract:
Large Vision-Language Models (LVLMs) can reason effectively over both textual and visual inputs, but they tend to hallucinate syntactically coherent yet visually ungrounded contents. In this paper, we investigate the internal dynamics of hallucination by examining the tokens logits ranking throughout the generation process, revealing three key patterns in how LVLMs process information: (1) gradual visual information loss - visually grounded tokens gradually become less favored throughout generation, and (2) early excitation - semantically meaningful tokens achieve peak activation in the layers earlier than the final layer. (3) hidden genuine information - visually grounded tokens though not being eventually decoded still retain relatively high rankings at inference. Based on these insights, we propose VISTA (Visual Information Steering with Token-logit Augmentation), a training-free inference-time intervention framework that reduces hallucination while promoting genuine information. VISTA works by combining two complementary approaches: reinforcing visual information in activation space and leveraging early layer activations to promote semantically meaningful decoding. Compared to existing methods, VISTA requires no external supervision and is applicable to various decoding strategies. Extensive experiments show that VISTA on average reduces hallucination by about 40% on evaluated open-ended generation task, and it consistently outperforms existing methods on four benchmarks across four architectures under three decoding strategies. Code is available at https://github.com/LzVv123456/VISTA.
中文: 大型视觉语言模型因视觉信息逐渐丢失和早期语义激发而产生幻觉内容,为此提出的VISTA框架通过增强视觉激活与利用早期层输出,无需训练即可有效减少幻觉现象。
English: Large Vision-Language Models often generate visually ungrounded content due to gradual loss of visual information and early token excitation, prompting the development of VISTA, a training-free framework that reduces hallucination by reinforcing visual activations and leveraging early layer outputs.

Authors:Mehrdad Asadi, Komi Sodoké, Ian J. Gerard, Marta Kersten-Oertel
Title: Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function
Abstract:
In this work, we present a novel approach to multi-label chest X-ray (CXR) image classification that enhances clinical interpretability while maintaining a streamlined, single-model, single-run training pipeline. Leveraging the CheXpert dataset and VisualCheXbert-derived labels, we incorporate hierarchical label groupings to capture clinically meaningful relationships between diagnoses. To achieve this, we designed a custom hierarchical binary cross-entropy (HBCE) loss function that enforces label dependencies using either fixed or data-driven penalty types. Our model achieved a mean area under the receiver operating characteristic curve (AUROC) of 0.903 on the test set. Additionally, we provide visual explanations and uncertainty estimations to further enhance model interpretability. All code, model configurations, and experiment details are made available.
中文: 本研究提出了一种新颖的胸部X光多标签分类方法,通过自定义分层损失函数增强临床可解释性,在测试集上取得了0.903的平均AUROC值,同时提供可视化解释和不确定性评估。
English: This study introduces a hierarchical multi-label classification method for chest X-rays that improves clinical interpretability through a custom loss function while maintaining high diagnostic accuracy with a 0.903 AUROC score.

Authors:Liran Nochumsohn, Hedi Zisling, Omri Azencot
Title: A Multi-Task Learning Approach to Linear Multivariate Forecasting
Abstract:
Accurate forecasting of multivariate time series data is important in many engineering and scientific applications. Recent state-of-the-art works ignore the inter-relations between variates, using their model on each variate independently. This raises several research questions related to proper modeling of multivariate data. In this work, we propose to view multivariate forecasting as a multi-task learning problem, facilitating the analysis of forecasting by considering the angle between task gradients and their balance. To do so, we analyze linear models to characterize the behavior of tasks. Our analysis suggests that tasks can be defined by grouping similar variates together, which we achieve via a simple clustering that depends on correlation-based similarities. Moreover, to balance tasks, we scale gradients with respect to their prediction error. Then, each task is solved with a linear model within our MTLinear framework. We evaluate our approach on challenging benchmarks in comparison to strong baselines, and we show it obtains on-par or better results on multivariate forecasting problems. The implementation is available at: https://github.com/azencot-group/MTLinear
中文: 本研究提出MTLinear多任务学习框架,通过将相关变量聚类为任务并基于预测误差平衡梯度,在多变量时间序列预测问题上取得了与现有方法相当或更优的性能。
English: This study introduces MTLinear, a multi-task learning framework for multivariate time series forecasting that groups correlated variates into tasks and balances their gradients based on prediction errors, achieving competitive or superior performance compared to existing methods.

Authors:Darina Koishigarina, Arnas Uselis, Seong Joon Oh
Title: CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
Abstract:
CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. We find that the correct attribute-object binding information is already present in individual text and image modalities. Instead, the issue lies in the cross-modal alignment, which relies on cosine similarity. To address this, we propose Linear Attribute Binding CLIP or LABCLIP. It applies a linear transformation to text embeddings before computing cosine similarity. This approach significantly improves CLIP's ability to bind attributes to correct objects, thereby enhancing its compositional understanding. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally.
中文: CLIP常因跨模态对齐问题无法正确绑定属性与对象,而LABCLIP通过在计算相似度前对文本嵌入进行线性变换,显著提升了这种绑定能力。
English: CLIP often fails to bind attributes to objects due to cross-modal alignment issues, but LABCLIP enhances this by applying a linear transformation to text embeddings before similarity computation.

Authors:Yassine El Kheir, Youness Samih, Suraj Maharjan, Tim Polzehl, Sebastian Möller
Title: Comprehensive Layer-wise Analysis of SSL Models for Audio Deepfake Detection
Abstract:
This paper conducts a comprehensive layer-wise analysis of self-supervised learning (SSL) models for audio deepfake detection across diverse contexts, including multilingual datasets (English, Chinese, Spanish), partial, song, and scene-based deepfake scenarios. By systematically evaluating the contributions of different transformer layers, we uncover critical insights into model behavior and performance. Our findings reveal that lower layers consistently provide the most discriminative features, while higher layers capture less relevant information. Notably, all models achieve competitive equal error rate (EER) scores even when employing a reduced number of layers. This indicates that we can reduce computational costs and increase the inference speed of detecting deepfakes by utilizing only a few lower layers. This work enhances our understanding of SSL models in deepfake detection, offering valuable insights applicable across varied linguistic and contextual settings. Our trained models and code are publicly available: https://github.com/Yaselley/SSL_Layerwise_Deepfake.
本研究发现在用于音频深度伪造检测的自监督学习模型中,下层Transformer层提供最具区分性的特征,仅用少量下层即可在多种场景下保持竞争力,从而降低计算成本并加速推理。
This study reveals that in self-supervised learning models for audio deepfake detection, lower transformer layers provide the most discriminative features, enabling competitive performance with fewer layers to reduce computational costs and speed up inference across diverse scenarios.

Authors:SiYeoul Lee, SeonHo Kim, Minkyung Seo, SeongKyu Park, Salehin Imrus, Kambaluru Ashok, DongEon Lee, Chunsu Park, SeonYeong Lee, Jiye Kim, Jae-Heung Yoo, MinWoo Kim
Title: Enhancing Free-hand 3D Photoacoustic and Ultrasound Reconstruction using Deep Learning
Abstract:
This study introduces a motion-based learning network with a global-local self-attention module (MoGLo-Net) to enhance 3D reconstruction in handheld photoacoustic and ultrasound (PAUS) imaging. Standard PAUS imaging is often limited by a narrow field of view and the inability to effectively visualize complex 3D structures. The 3D freehand technique, which aligns sequential 2D images for 3D reconstruction, faces significant challenges in accurate motion estimation without relying on external positional sensors. MoGLo-Net addresses these limitations through an innovative adaptation of the self-attention mechanism, which effectively exploits the critical regions, such as fully-developed speckle area or high-echogenic tissue area within successive ultrasound images to accurately estimate motion parameters. This facilitates the extraction of intricate features from individual frames. Additionally, we designed a patch-wise correlation operation to generate a correlation volume that is highly correlated with the scanning motion. A custom loss function was also developed to ensure robust learning with minimized bias, leveraging the characteristics of the motion parameters. Experimental evaluations demonstrated that MoGLo-Net surpasses current state-of-the-art methods in both quantitative and qualitative performance metrics. Furthermore, we expanded the application of 3D reconstruction technology beyond simple B-mode ultrasound volumes to incorporate Doppler ultrasound and photoacoustic imaging, enabling 3D visualization of vasculature. The source code for this study is publicly available at: https://github.com/guhong3648/US3D
中文: 本研究提出MoGLo-Net运动学习网络,通过全局-局部自注意力模块改进手持式光声超声成像的3D重建,无需外部传感器即可精确估计运动参数,性能超越现有方法,并可扩展至多普勒和光声血管成像应用。
English: This study presents MoGLo-Net, a motion-based learning network with a global-local self-attention module that improves 3D reconstruction in handheld PAUS imaging by accurately estimating motion parameters without external sensors, outperforming current methods and extending to Doppler and photoacoustic vascular visualization.

Authors:Qiuhong Shen, Xuanyu Yi, Mingbao Lin, Hanwang Zhang, Shuicheng Yan, Xinchao Wang
Title: Seeing World Dynamics in a Nutshell
Abstract:
We consider the problem of efficiently representing casually captured monocular videos in a spatially- and temporally-coherent manner. While existing approaches predominantly rely on 2D/2.5D techniques treating videos as collections of spatiotemporal pixels, they struggle with complex motions, occlusions, and geometric consistency due to absence of temporal coherence and explicit 3D structure. Drawing inspiration from monocular video as a projection of the dynamic 3D world, we explore representing videos in their intrinsic 3D form through continuous flows of Gaussian primitives in space-time. In this paper, we propose NutWorld, a novel framework that efficiently transforms monocular videos into dynamic 3D Gaussian representations in a single forward pass. At its core, NutWorld introduces a structured spatial-temporal aligned Gaussian (STAG) representation, enabling optimization-free scene modeling with effective depth and flow regularization. Through comprehensive experiments, we demonstrate that NutWorld achieves high-fidelity video reconstruction quality while enabling various downstream applications in real-time. Demos and code will be available at https://github.com/Nut-World/NutWorld.
中文摘要:本文提出NutWorld框架,通过单次前向传播将单目视频高效转换为动态3D高斯表示,实现了高保真视频重建并支持实时下游应用。
English Summary: This paper introduces NutWorld, a novel framework that efficiently converts monocular videos into dynamic 3D Gaussian representations in a single forward pass, achieving high-fidelity reconstruction and enabling real-time applications.

Authors:Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry
Title: Do Large Language Model Benchmarks Test Reliability?
Abstract:
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at https://github.com/MadryLab/platinum-benchmarks
中文摘要:现有大语言模型基准常存在标签错误而掩盖可靠性问题,为此提出的铂金基准揭示了先进模型在简单任务上仍存在持续性缺陷。
English Summary: Current benchmarks for large language models often contain label errors that obscure reliability issues, prompting the creation of platinum benchmarks which reveal persistent failures in even advanced models on simple tasks.

Authors:Rui Pan, Boyao Wang, Shizhe Diao, Xingyuan Pan, Jipeng Zhang, Renjie Pi, Tong Zhang
Title: Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training
Abstract:
Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks. The official code is released at https://github.com/research4pan/AdaptPruner.
中文: 自适应剪枝结合增量训练使小型语言模型在显著降低计算成本的同时,性能可媲美预训练模型,并优于传统剪枝方法。
English: Adaptive pruning combined with incremental training enables small language models to achieve performance comparable to pre-trained models while significantly reducing computational costs and outperforming conventional pruning methods.

Authors:Rudolf Herdt, Daniel Otero Baguer
Title: Concept Based Explanations and Class Contrasting
Abstract:
Explaining deep neural networks is challenging, due to their large size and non-linearity. In this paper, we introduce a concept-based explanation method, in order to explain the prediction for an individual class, as well as contrasting any two classes, i.e. explain why the model predicts one class over the other. We test it on several openly available classification models trained on ImageNet1K. We perform both qualitative and quantitative tests. For example, for a ResNet50 model from pytorch model zoo, we can use the explanation for why the model predicts a class 'A' to automatically select four dataset crops where the model does not predict class 'A'. The model then predicts class 'A' again for the newly combined image in 91.1% of the cases (works for 911 out of the 1000 classes). The code including an .ipynb example is available on github: https://github.com/rherdt185/concept-based-explanations-and-class-contrasting
Chinese: 本文提出了一种基于概念的解释方法,用于阐明深度神经网络对单个类别的预测及类别间对比,在ImageNet模型上通过定性和定量测试验证了其91.1%的有效性。
English: This paper presents a concept-based explanation method for deep neural networks that clarifies individual class predictions and contrasts between classes, validated through qualitative and quantitative tests on ImageNet models with 91.1% effectiveness.

Authors:Xinyu Mao, Teerapong Leelanupab, Harrisen Scells, Guido Zuccon
Title: DenseReviewer: A Screening Prioritisation Tool for Systematic Review based on Dense Retrieval
Abstract:
Screening is a time-consuming and labour-intensive yet required task for medical systematic reviews, as tens of thousands of studies often need to be screened. Prioritising relevant studies to be screened allows downstream systematic review creation tasks to start earlier and save time. In previous work, we developed a dense retrieval method to prioritise relevant studies with reviewer feedback during the title and abstract screening stage. Our method outperforms previous active learning methods in both effectiveness and efficiency. In this demo, we extend this prior work by creating (1) a web-based screening tool that enables end-users to screen studies exploiting state-of-the-art methods and (2) a Python library that integrates models and feedback mechanisms and allows researchers to develop and demonstrate new active learning methods. We describe the tool's design and showcase how it can aid screening. The tool is available at https://densereviewer.ielab.io. The source code is also open sourced at https://github.com/ielab/densereviewer.
中文: 该摘要介绍了一款基于网络的筛选工具和Python库,采用先进的密集检索和主动学习方法,能高效优先处理医学系统综述相关研究,从而节省时间并提升筛选效果。
English: This abstract introduces a web-based screening tool and Python library that use advanced dense retrieval and active learning methods to efficiently prioritize relevant studies for medical systematic reviews, saving time and improving effectiveness.

Authors:Hongli Zhan, Muneeza Azmat, Raya Horesh, Junyi Jessy Li, Mikhail Yurochkin
Title: SPRI: Aligning Large Language Models with Context-Situated Principles
Abstract:
Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at https://github.com/honglizhan/SPRI-public.
将大型语言模型与人类价值观对齐因依赖人工监督成本高昂而困难重重,但SPRI框架通过为每个查询自动生成实时情境化原则,无需大量人工介入即可提升模型表现与真实性。
Aligning Large Language Models with human values is challenging due to the high cost of human oversight, but the SPRI framework addresses this by automatically generating real-time, context-specific principles for each query, enhancing performance and truthfulness without extensive human input.

Authors:Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue
Title: Demystifying Long Chain-of-Thought Reasoning in LLMs
Abstract:
Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.
中文: 本研究通过监督微调和强化学习的系统实验,揭示了增强大语言模型中长思维链推理能力的关键因素,包括奖励塑造和训练计算扩展,并发现基础模型虽具备核心能力但需精细优化策略。
English: This study systematically investigates how to enhance long chain-of-thought reasoning in large language models through supervised fine-tuning and reinforcement learning, identifying key factors like reward shaping and training compute scaling while revealing that core abilities exist in base models but require careful optimization.

Authors:Yu Wang, Lei Sang, Yi Zhang, Yiwen Zhang
Title: Intent Representation Learning with Large Language Model for Recommendation
Abstract:
Intent-based recommender systems have garnered significant attention for uncovering latent fine-grained preferences. Intents, as underlying factors of interactions, are crucial for improving recommendation interpretability. Most methods define intents as learnable parameters updated alongside interactions. However, existing frameworks often overlook textual information (e.g., user reviews, item descriptions), which is crucial for alleviating the sparsity of interaction intents. Exploring these multimodal intents, especially the inherent differences in representation spaces, poses two key challenges: i) How to align multimodal intents and effectively mitigate noise issues; ii) How to extract and match latent key intents across modalities. To tackle these challenges, we propose a model-agnostic framework, Intent Representation Learning with Large Language Model (IRLLRec), which leverages large language models (LLMs) to construct multimodal intents and enhance recommendations. Specifically, IRLLRec employs a dual-tower architecture to learn multimodal intent representations. Next, we propose pairwise and translation alignment to eliminate inter-modal differences and enhance robustness against noisy input features. Finally, to better match textual and interaction-based intents, we employ momentum distillation to perform teacher-student learning on fused intent representations. Empirical evaluations on three datasets show that our IRLLRec framework outperforms baselines.Code available at https://github.com/wangyu0627/IRLLRec.
中文摘要:IRLLRec框架利用大型语言模型构建多模态意图,通过双塔架构学习表示、对齐消除模态差异,并采用动量蒸馏匹配潜在关键意图,从而提升推荐性能。
English Summary: The IRLLRec framework leverages large language models to align multimodal intents and employs momentum distillation to enhance recommendation accuracy by addressing representation differences and noise across interaction and textual data.

Authors:Ying Zhang, Maoliang Yin, Wenfu Bi, Haibao Yan, Shaohan Bian, Cui-Hua Zhang, Changchun Hua
Title: ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models
Abstract:
Service robots operating in unstructured environments must effectively recognize and segment unknown objects to enhance their functionality. Traditional supervised learningbased segmentation techniques require extensive annotated datasets, which are impractical for the diversity of objects encountered in real-world scenarios. Unseen Object Instance Segmentation (UOIS) methods aim to address this by training models on synthetic data to generalize to novel objects, but they often suffer from the simulation-to-reality gap. This paper proposes a novel approach (ZISVFM) for solving UOIS by leveraging the powerful zero-shot capability of the segment anything model (SAM) and explicit visual representations from a selfsupervised vision transformer (ViT). The proposed framework operates in three stages: (1) generating object-agnostic mask proposals from colorized depth images using SAM, (2) refining these proposals using attention-based features from the selfsupervised ViT to filter non-object masks, and (3) applying K-Medoids clustering to generate point prompts that guide SAM towards precise object segmentation. Experimental validation on two benchmark datasets and a self-collected dataset demonstrates the superior performance of ZISVFM in complex environments, including hierarchical settings such as cabinets, drawers, and handheld objects. Our source code is available at https://github.com/Yinmlmaoliang/zisvfm.
中文摘要:本文提出ZISVFM框架,通过结合SAM的零样本分割能力和自监督ViT特征,采用掩码生成、精化及聚类提示的三阶段方法,在复杂环境中实现了精确的未知物体实例分割。
English Summary: This paper introduces ZISVFM, a novel framework that combines the zero-shot segmentation capability of SAM with self-supervised ViT features to achieve precise unseen object instance segmentation through a three-stage process of mask proposal, refinement, and clustering-based prompting.

Authors:Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
Title: Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration
Abstract:
Recently computer-aided diagnosis has demonstrated promising performance, effectively alleviating the workload of clinicians. However, the inherent sample imbalance among different diseases leads algorithms biased to the majority categories, leading to poor performance for rare categories. Existing works formulated this challenge as a long-tailed problem and attempted to tackle it by decoupling the feature representation and classification. Yet, due to the imbalanced distribution and limited samples from tail classes, these works are prone to biased representation learning and insufficient classifier calibration. To tackle these problems, we propose a new Long-tailed Medical Diagnosis (LMD) framework for balanced medical image classification on long-tailed datasets. In the initial stage, we develop a Relation-aware Representation Learning (RRL) scheme to boost the representation ability by encouraging the encoder to capture intrinsic semantic features through different data augmentations. In the subsequent stage, we propose an Iterative Classifier Calibration (ICC) scheme to calibrate the classifier iteratively. This is achieved by generating a large number of balanced virtual features and fine-tuning the encoder using an Expectation-Maximization manner. The proposed ICC compensates for minority categories to facilitate unbiased classifier optimization while maintaining the diagnostic knowledge in majority classes. Comprehensive experiments on three public long-tailed medical datasets demonstrate that our LMD framework significantly surpasses state-of-the-art approaches. The source code can be accessed at https://github.com/peterlipan/LMD.
中文: 提出的长尾医疗诊断框架通过关系感知表征学习和迭代分类器校准解决医学图像中的类别不平衡问题,在三个公共数据集上显著超越了现有最优方法。
English: The proposed Long-tailed Medical Diagnosis (LMD) framework addresses class imbalance in medical image analysis through Relation-aware Representation Learning and Iterative Classifier Calibration, achieving state-of-the-art performance on three public datasets.

Authors:Ruizhe Li, Grazziela Figueredo, Dorothee Auer, Rob Dineen, Paul Morgan, Xin Chen
Title: A Unified Framework for Semi-Supervised Image Segmentation and Registration
Abstract:
Semi-supervised learning, which leverages both annotated and unannotated data, is an efficient approach for medical image segmentation, where obtaining annotations for the whole dataset is time-consuming and costly. Traditional semi-supervised methods primarily focus on extracting features and learning data distributions from unannotated data to enhance model training. In this paper, we introduce a novel approach incorporating an image registration model to generate pseudo-labels for the unannotated data, producing more geometrically correct pseudo-labels to improve the model training. Our method was evaluated on a 2D brain data set, showing excellent performance even using only 1\% of the annotated data. The results show that our approach outperforms conventional semi-supervised segmentation methods (e.g. teacher-student model), particularly in a low percentage of annotation scenario. GitHub: https://github.com/ruizhe-l/UniSegReg.
中文: 本文提出了一种新颖的半监督医学图像分割方法,通过引入图像配准模型生成几何更准确的伪标签,在极少量标注数据下显著优于传统方法。
English: This paper presents a novel semi-supervised medical image segmentation method that integrates an image registration model to generate geometrically accurate pseudo-labels, demonstrating superior performance with minimal annotated data compared to traditional approaches.

Authors:Xiangyu Dong, Xingyi Zhang, Lei Chen, Mingxuan Yuan, Sibo Wang
Title: SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels
Abstract:
Node Anomaly Detection (NAD) has gained significant attention in the deep learning community due to its diverse applications in real-world scenarios. Existing NAD methods primarily embed graphs within a single Euclidean space, while overlooking the potential of non-Euclidean spaces. Besides, to address the prevalent issue of limited supervision in real NAD tasks, previous methods tend to leverage synthetic data to collect auxiliary information, which is not an effective solution as shown in our experiments. To overcome these challenges, we introduce a novel SpaceGNN model designed for NAD tasks with extremely limited labels. Specifically, we provide deeper insights into a task-relevant framework by empirically analyzing the benefits of different spaces for node representations, based on which, we design a Learnable Space Projection function that effectively encodes nodes into suitable spaces. Besides, we introduce the concept of weighted homogeneity, which we empirically and theoretically validate as an effective coefficient during information propagation. This concept inspires the design of the Distance Aware Propagation module. Furthermore, we propose the Multiple Space Ensemble module, which extracts comprehensive information for NAD under conditions of extremely limited supervision. Our findings indicate that this module is more beneficial than data augmentation techniques for NAD. Extensive experiments conducted on 9 real datasets confirm the superiority of SpaceGNN, which outperforms the best rival by an average of 8.55% in AUC and 4.31% in F1 scores. Our code is available at https://github.com/xydong127/SpaceGNN.
中文摘要:SpaceGNN模型通过将节点编码到合适的非欧几里得空间并引入加权同质性概念来改进信息传播,有效解决了节点异常检测中监督信息有限的问题,在多个真实数据集上表现出优越性能。
English Summary: The SpaceGNN model addresses limitations in Node Anomaly Detection by encoding nodes into suitable non-Euclidean spaces and introducing weighted homogeneity for improved information propagation, achieving superior performance with limited supervision.

Authors:Yuchao Wu, Xiaofei Yu, Hao Chen, Yang Luo, Yeyu Tong, Yuzhe Ma
Title: PICBench: Benchmarking LLMs for Photonic Integrated Circuits Design
Abstract:
While large language models (LLMs) have shown remarkable potential in automating various tasks in digital chip design, the field of Photonic Integrated Circuits (PICs)-a promising solution to advanced chip designs-remains relatively unexplored in this context. The design of PICs is time-consuming and prone to errors due to the extensive and repetitive nature of code involved in photonic chip design. In this paper, we introduce PICBench, the first benchmarking and evaluation framework specifically designed to automate PIC design generation using LLMs, where the generated output takes the form of a netlist. Our benchmark consists of dozens of meticulously crafted PIC design problems, spanning from fundamental device designs to more complex circuit-level designs. It automatically evaluates both the syntax and functionality of generated PIC designs by comparing simulation outputs with expert-written solutions, leveraging an open-source simulator. We evaluate a range of existing LLMs, while also conducting comparative tests on various prompt engineering techniques to enhance LLM performance in automated PIC design. The results reveal the challenges and potential of LLMs in the PIC design domain, offering insights into the key areas that require further research and development to optimize automation in this field. Our benchmark and evaluation code is available at https://github.com/PICDA/PICBench.
中文摘要:本文提出了PICBench,这是首个利用大语言模型自动生成光子集成电路网表,并通过与专家方案对比来评估设计语法和功能性的基准测试框架。
English Summary: This paper introduces PICBench, the first benchmarking framework that uses large language models to automate photonic integrated circuit design by generating netlists and evaluating their syntax and functionality against expert solutions.

Authors:Yifan Sun, Rui Chen, Kai S. Yun, Yikuan Fang, Sebin Jung, Feihan Li, Bowei Li, Weiye Zhao, Changliu Liu
Title: SPARK: A Modular Benchmark for Humanoid Robot Safety
Abstract:
This paper introduces the Safe Protective and Assistive Robot Kit (SPARK), a comprehensive benchmark designed to ensure safety in humanoid autonomy and teleoperation. Humanoid robots pose significant safety risks due to their physical capabilities of interacting with complex environments. The physical structures of humanoid robots further add complexity to the design of general safety solutions. To facilitate safe deployment of complex robot systems, SPARK can be used as a toolbox that comes with state-of-the-art safe control algorithms in a modular and composable robot control framework. Users can easily configure safety criteria and sensitivity levels to optimize the balance between safety and performance. To accelerate humanoid safety research and development, SPARK provides simulation benchmarks that compare safety approaches in a variety of environments, tasks, and robot models. Furthermore, SPARK allows quick deployment of synthesized safe controllers on real robots. For hardware deployment, SPARK supports Apple Vision Pro (AVP) or a Motion Capture System as external sensors, while offering interfaces for seamless integration with alternative hardware setups at the same time. This paper demonstrates SPARK's capability with both simulation experiments and case studies with a Unitree G1 humanoid robot. Leveraging these advantages of SPARK, users and researchers can significantly improve the safety of their humanoid systems as well as accelerate relevant research. The open source code is available at: https://github.com/intelligent-control-lab/spark.
中文: 本文介绍SPARK,一个模块化基准框架与工具包,通过可配置安全算法和仿真基准提升人形机器人自主与遥操作安全性,并通过实验与案例验证其有效性。
English: This paper presents SPARK, a modular benchmark and toolbox for enhancing safety in humanoid robot autonomy and teleoperation through configurable safety algorithms and simulation benchmarks, validated via experiments and case studies.

Authors:Wen Yan, Qianye Yang, Shiqi Huang, Yipei Wang, Shonit Punwani, Mark Emberton, Vasilis Stavrinides, Yipeng Hu, Dean Barratt
Title: Tell2Reg: Establishing spatial correspondence between images by the same language prompts
Abstract:
Spatial correspondence can be represented by pairs of segmented regions, such that the image registration networks aim to segment corresponding regions rather than predicting displacement fields or transformation parameters. In this work, we show that such a corresponding region pair can be predicted by the same language prompt on two different images using the pre-trained large multimodal models based on GroundingDINO and SAM. This enables a fully automated and training-free registration algorithm, potentially generalisable to a wide range of image registration tasks. In this paper, we present experimental results using one of the challenging tasks, registering inter-subject prostate MR images, which involves both highly variable intensity and morphology between patients. Tell2Reg is training-free, eliminating the need for costly and time-consuming data curation and labelling that was previously required for this registration task. This approach outperforms unsupervised learning-based registration methods tested, and has a performance comparable to weakly-supervised methods. Additional qualitative results are also presented to suggest that, for the first time, there is a potential correlation between language semantics and spatial correspondence, including the spatial invariance in language-prompted regions and the difference in language prompts between the obtained local and global correspondences. Code is available at https://github.com/yanwenCi/Tell2Reg.git.
中文摘要:本文提出Tell2Reg这一无需训练的图像配准方法,通过语言提示与预训练多模态模型自动分割不同图像中的对应区域,在无需数据标注的情况下实现了与弱监督方法相当的性能。
English Summary: This paper introduces Tell2Reg, a training-free image registration method that uses language prompts with pre-trained multimodal models to automatically segment corresponding regions in different images, achieving performance comparable to weakly-supervised methods without requiring data labeling.

Authors:Dan MacKinlay
Title: The Ensemble Kalman Update is an Empirical Matheron Update
Abstract:
The Ensemble Kalman Filter (EnKF) is a widely used method for data assimilation in high-dimensional systems, with an ensemble update step equivalent to an empirical version of the Matheron update popular in Gaussian process regression -- a connection that links half a century of data-assimilation engineering to modern path-wise GP sampling. This paper provides a compact introduction to this simple but under-exploited connection, with necessary definitions accessible to all fields involved. Source code is available at https://github.com/danmackinlay/paper_matheron_equals_enkf .
中文: 集合卡尔曼滤波的集合更新步骤等同于高斯过程回归中的马瑟隆更新,这一联系将半个世纪的数据同化工程与现代路径式GP采样技术联系起来。
English: The Ensemble Kalman Filter's ensemble update is equivalent to the Matheron update in Gaussian process regression, connecting decades of data assimilation with modern GP sampling techniques.

Authors:Yufei Ye, Wei Guo, Jin Yao Chin, Hao Wang, Hong Zhu, Xi Lin, Yuyang Ye, Yong Liu, Ruiming Tang, Defu Lian, Enhong Chen
Title: FuXi-$α$: Scaling Recommendation Model with Feature Interaction Enhanced Transformer
Abstract:
Inspired by scaling laws and large language models, research on large-scale recommendation models has gained significant attention. Recent advancements have shown that expanding sequential recommendation models to large-scale recommendation models can be an effective strategy. Current state-of-the-art sequential recommendation models primarily use self-attention mechanisms for explicit feature interactions among items, while implicit interactions are managed through Feed-Forward Networks (FFNs). However, these models often inadequately integrate temporal and positional information, either by adding them to attention weights or by blending them with latent representations, which limits their expressive power. A recent model, HSTU, further reduces the focus on implicit feature interactions, constraining its performance. We propose a new model called FuXi-$α$ to address these issues. This model introduces an Adaptive Multi-channel Self-attention mechanism that distinctly models temporal, positional, and semantic features, along with a Multi-stage FFN to enhance implicit feature interactions. Our offline experiments demonstrate that our model outperforms existing models, with its performance continuously improving as the model size increases. Additionally, we conducted an online A/B test within the Huawei Music app, which showed a $4.76\%$ increase in the average number of songs played per user and a $5.10\%$ increase in the average listening duration per user. Our code has been released at https://github.com/USTC-StarTeam/FuXi-alpha.
中文摘要:FuXi-α模型通过自适应多通道自注意力机制分别建模时间、位置和语义特征,并采用多阶段前馈网络增强隐式交互,在离线实验中表现优异,在线A/B测试显著提升了用户平均播放歌曲数量和收听时长。
English Summary: The FuXi-α model enhances sequential recommendations by introducing an Adaptive Multi-channel Self-attention mechanism for distinct temporal, positional, and semantic feature modeling and a Multi-stage FFN to improve implicit interactions, achieving superior offline performance and significant online gains in user engagement metrics.

Authors:Mohannad Takrouri, Nicolás M. Cuadrado, Martin Takáč
Title: Knowledge Distillation from Large Language Models for Household Energy Modeling
Abstract:
Machine learning (ML) is increasingly vital for smart-grid research, yet restricted access to realistic, diverse data - often due to privacy concerns - slows progress and fuels doubts within the energy sector about adopting ML-based strategies. We propose integrating Large Language Models (LLMs) in energy modeling to generate realistic, culturally sensitive, and behavior-specific data for household energy usage across diverse geographies. In this study, we employ and compare five different LLMs to systematically produce family structures, weather patterns, and daily consumption profiles for households in six distinct countries. A four-stage methodology synthesizes contextual daily data, including culturally nuanced activities, realistic weather ranges, HVAC operations, and distinct `energy signatures' that capture unique consumption footprints. Additionally, we explore an alternative strategy where external weather datasets can be directly integrated, bypassing intermediate weather modeling stages while ensuring physically consistent data inputs. The resulting dataset provides insights into how cultural, climatic, and behavioral factors converge to shape carbon emissions, offering a cost-effective avenue for scenario-based energy optimization. This approach underscores how prompt engineering, combined with knowledge distillation, can advance sustainable energy research and climate mitigation efforts. Source code is available at https://github.com/Singularity-AI-Lab/LLM-Energy-Knowledge-Distillation .
中文摘要:本研究提出利用大型语言模型生成真实且具有文化敏感性的家庭能源数据,解决了智能电网研究中数据稀缺的问题,并能够对文化、气候和行为因素共同影响的碳排放进行成本效益分析。
English Summary: This study introduces a method using Large Language Models to generate realistic and culturally sensitive household energy data, addressing data scarcity in smart-grid research and enabling cost-effective analysis of carbon emissions influenced by cultural, climatic, and behavioral factors.

Authors:Hao Zeng, Kangdao Liu, Bingyi Jing, Hongxin Wei
Title: Parametric Scaling Law of Tuning Bias in Conformal Prediction
Abstract:
Conformal prediction is a popular framework of uncertainty quantification that constructs prediction sets with coverage guarantees. To uphold the exchangeability assumption, many conformal prediction methods necessitate an additional holdout set for parameter tuning. Yet, the impact of violating this principle on coverage remains underexplored, making it ambiguous in practical applications. In this work, we empirically find that the tuning bias - the coverage gap introduced by leveraging the same dataset for tuning and calibration, is negligible for simple parameter tuning in many conformal prediction methods. In particular, we observe the scaling law of the tuning bias: this bias increases with parameter space complexity and decreases with calibration set size. Formally, we establish a theoretical framework to quantify the tuning bias and provide rigorous proof for the scaling law of the tuning bias by deriving its upper bound. In the end, we discuss how to reduce the tuning bias, guided by the theories we developed.
中文: 本研究证明,在共形预测中因使用同一数据集进行参数调优和校准而产生的调优偏差对于简单参数调优可忽略不计,且该偏差遵循参数复杂度增加而增大、校准集规模扩大而减小的缩放规律,并通过理论分析和提出的缓解策略加以验证。
English: This study demonstrates that the tuning bias in conformal prediction, arising from using the same dataset for parameter tuning and calibration, is minimal for simple parameter tuning and follows a scaling law where bias increases with parameter complexity but decreases with calibration set size, supported by theoretical analysis and proposed mitigation strategies.

Authors:Berné L. Nortier, Simon Dobson, Federico Battiston
Title: Higher-order shortest paths in hypergraphs
Abstract:
One of the defining features of complex networks is the connectivity properties that we observe emerging from local interactions. Recently, hypergraphs have emerged as a versatile tool to model networks with non-dyadic, higher-order interactions. Nevertheless, the connectivity properties of real-world hypergraphs remain largely understudied. In this work we introduce path size as a measure to characterise higher-order connectivity and quantify the relevance of non-dyadic ties for efficient shortest paths in a diverse set of empirical networks with and without temporal information. By comparing our results with simple randomised null models, our analysis presents a nuanced picture, suggesting that non-dyadic ties are often central and are vital for system connectivity, while dyadic edges remain essential to connect more peripheral nodes, an effect which is particularly pronounced for time-varying systems. Our work contributes to a better understanding of the structural organisation of systems with higher-order interactions.
Chinese: 本研究引入路径大小来评估超图中的高阶连通性,发现非二元联系对系统连通性至关重要,而二元边则连接外围节点,这一效应在时变网络中尤为显著。
English: This study introduces path size to assess higher-order connectivity in hypergraphs, revealing that non-dyadic ties are crucial for system connectivity while dyadic edges link peripheral nodes, especially in time-varying networks.

Authors:Seng Pei Liew, Takuya Kato, Sho Takase
Title: Scaling Laws for Upcycling Mixture-of-Experts Language Models
Abstract:
Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.
中文: 本研究探索了将大型语言模型升级为专家混合架构的扩展规律,通过实证发现性能随数据集规模和模型配置扩展而提升,但识别出密集与升级数据集间的限制性交互作用,最终为预算内高效升级提供了策略指导。
English: This study explores the scaling behavior of upcycling large language models into mixture-of-experts architectures, revealing empirical laws that show performance gains from scaling dataset size and model configuration but identify a limiting interaction between dense and upcycled datasets, ultimately providing guidance for cost-effective upcycling strategies.

Authors:Yang Li, Jinpei Guo, Runzhong Wang, Hongyuan Zha, Junchi Yan
Title: Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization
Abstract:
Diffusion models have recently advanced Combinatorial Optimization (CO) as a powerful backbone for neural solvers. However, their iterative sampling process requiring denoising across multiple noise levels incurs substantial overhead. We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots. This is achieved through an optimization consistency training protocol, which, for a given instance, minimizes the difference among samples originating from varying generative trajectories and time steps relative to the optimal solution. The proposed model enables fast single-step solution generation while retaining the option of multi-step sampling to trade for sampling quality, which offers a more effective and efficient alternative backbone for neural solvers. In addition, within the training-to-testing (T2T) framework, to bridge the gap between training on historical instances and solving new instances, we introduce a novel consistency-based gradient search scheme during the test stage, enabling more effective exploration of the solution space learned during training. It is achieved by updating the latent solution probabilities under objective gradient guidance during the alternation of noise injection and denoising steps. We refer to this model as Fast T2T. Extensive experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency, even outperforming LKH given limited time budgets. Notably, Fast T2T with merely one-step generation and one-step gradient search can mostly outperform the SOTA diffusion-based counterparts that require hundreds of steps, while achieving tens of times speedup.
Chinese: 提出的Fast T2T模型通过学习从噪声水平到最优解的直接映射,实现了组合优化的高效单步求解,同时结合基于一致性的梯度搜索来提升解的质量和速度,在旅行商问题和最大独立集任务上展现出优越性能。
English: The proposed Fast T2T model enables efficient single-step solution generation for combinatorial optimization by learning direct mappings from noise levels to optimal solutions, while incorporating a consistency-based gradient search to enhance solution quality and speed.

Authors:T. Chay-intr, Y. Chen, K. Viriyayudhakorn, T. Theeramunkong
Title: LLaVAC: Fine-tuning LLaVA as a Multimodal Sentiment Classifier
Abstract:
We present LLaVAC, a method for constructing a classifier for multimodal sentiment analysis. This method leverages fine-tuning of the Large Language and Vision Assistant (LLaVA) to predict sentiment labels across both image and text modalities. Our approach involves designing a structured prompt that incorporates both unimodal and multimodal labels to fine-tune LLaVA, enabling it to perform sentiment classification effectively. Experiments on the MVSA-Single dataset demonstrate that LLaVAC outperforms existing methods in multimodal sentiment analysis across three data processing procedures. The implementation of LLaVAC is publicly available at https://github.com/tchayintr/llavac.
中文: LLaVAC方法通过设计结构化提示对LLaVA模型进行微调,在MVSA-Single数据集上的多模态情感分析任务中表现优于现有方法。
English: LLaVAC is a method that fine-tunes the LLaVA model with structured prompts for multimodal sentiment analysis, achieving superior performance on the MVSA-Single dataset compared to existing approaches.

Authors:Yuan Tian, Wenqi Zhou, Michele Viscione, Hao Dong, David Kammer, Olga Fink
Title: Interactive Symbolic Regression through Offline Reinforcement Learning: A Co-Design Framework
Abstract:
Symbolic Regression (SR) holds great potential for uncovering underlying mathematical and physical relationships from observed data. However, the vast combinatorial space of possible expressions poses significant challenges for both online search methods and pre-trained transformer models. Additionally, current state-of-the-art approaches typically do not consider the integration of domain experts' prior knowledge and do not support iterative interactions with the model during the equation discovery process. To address these challenges, we propose the Symbolic Q-network (Sym-Q), an advanced interactive framework for large-scale symbolic regression. Unlike previous large-scale transformer-based SR approaches, Sym-Q leverages reinforcement learning without relying on a transformer-based decoder. This formulation allows the agent to learn through offline reinforcement learning using any type of tree encoder, enabling more efficient training and inference. Furthermore, we propose a co-design mechanism, where the reinforcement learning-based Sym-Q facilitates effective interaction with domain experts at any stage of the equation discovery process. Users can dynamically modify generated nodes of the expression, collaborating with the agent to tailor the mathematical expression to best fit the problem and align with the assumed physical laws, particularly when there is prior partial knowledge of the expected behavior. Our experiments demonstrate that the pre-trained Sym-Q surpasses existing SR algorithms on the challenging SSDNC benchmark. Moreover, we experimentally show on real-world cases that its performance can be further enhanced by the interactive co-design mechanism, with Sym-Q achieving greater performance gains than other state-of-the-art models. Our reproducible code is available at https://github.com/EPFL-IMOS/Sym-Q.
中文摘要:符号Q网络(Sym-Q)是一种基于强化学习的交互式框架,通过离线强化学习和树编码器实现高效训练推理,并允许领域专家在方程发现过程中动态修改表达式节点,在基准测试和实际案例中均超越现有方法。
English Summary: The Symbolic Q-network (Sym-Q) is a reinforcement learning-based interactive framework that overcomes limitations of traditional symbolic regression methods by enabling efficient training, inference, and dynamic collaboration with domain experts to refine mathematical expressions.

Authors:Bradley P. Allen, Paul T. Groth
Title: A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs
Abstract:
Evaluating large language models (LLMs) for tasks like fact extraction in support of knowledge graph construction frequently involves computing accuracy metrics using a ground truth benchmark based on a knowledge graph (KG). These evaluations assume that errors represent factual disagreements. However, human discourse frequently features metalinguistic disagreement, where agents differ not on facts but on the meaning of the language used to express them. Given the complexity of natural language processing and generation using LLMs, we ask: do metalinguistic disagreements occur between LLMs and KGs? Based on an investigation using the T-REx knowledge alignment dataset, we hypothesize that metalinguistic disagreement does in fact occur between LLMs and KGs, with potential relevance for the practice of knowledge graph engineering. We propose a benchmark for evaluating the detection of factual and metalinguistic disagreements between LLMs and KGs. An initial proof of concept of such a benchmark is available on Github.
中文: 本研究探讨了大型语言模型与知识图谱之间是否存在元语言分歧,并基于T-REx数据集提出了一个检测事实性和元语言分歧的基准。
English: This study investigates whether metalinguistic disagreements occur between large language models and knowledge graphs, proposing a benchmark for detecting both factual and metalinguistic discrepancies based on the T-REx dataset.

Authors:Baoyao Yang, Junxiang Chen, Wanyun Li, Wenbin Yao, Yang Zhou
Title: Expertized Caption Auto-Enhancement for Video-Text Retrieval
Abstract:
Video-text retrieval has been stuck in the information mismatch caused by personalized and inadequate textual descriptions of videos. The substantial information gap between the two modalities hinders an effective cross-modal representation alignment, resulting in ambiguous retrieval results. Although text rewriting methods have been proposed to broaden text expressions, the modality gap remains significant, as the text representation space is hardly expanded with insufficient semantic enrichment.Instead, this paper turns to enhancing visual presentation, bridging video expression closer to textual representation via caption generation and thereby facilitating video-text matching.While multimodal large language models (mLLM) have shown a powerful capability to convert video content into text, carefully crafted prompts are essential to ensure the reasonableness and completeness of the generated captions. Therefore, this paper proposes an automatic caption enhancement method that improves expression quality and mitigates empiricism in augmented captions through self-learning.Additionally, an expertized caption selection mechanism is designed and introduced to customize augmented captions for each video, further exploring the utilization potential of caption augmentation.Our method is entirely data-driven, which not only dispenses with heavy data collection and computation workload but also improves self-adaptability by circumventing lexicon dependence and introducing personalized matching. The superiority of our method is validated by state-of-the-art results on various benchmarks, specifically achieving Top-1 recall accuracy of 68.5% on MSR-VTT, 68.1% on MSVD, and 62.0% on DiDeMo. Our code is publicly available at https://github.com/CaryXiang/ECA4VTR.
Chinese: 本文通过自动字幕增强方法和专业化字幕选择机制改进视觉表达以弥合视频与文本间的信息鸿沟,无需大量数据收集即在多个基准测试中取得领先性能。
English: This paper addresses the video-text retrieval mismatch by enhancing visual presentation through an automatic caption enhancement method and expertized caption selection, achieving state-of-the-art results on benchmarks without heavy data collection.

Authors:Xiaofan Yu, Lanxiang Hu, Benjamin Reichman, Dylan Chu, Rushil Chandrupatla, Xiyuan Zhang, Larry Heck, Tajana Rosing
Title: SensorChat: Answering Qualitative and Quantitative Questions during Long-Term Multimodal Sensor Interactions
Abstract:
Natural language interaction with sensing systems is crucial for addressing users' personal concerns and providing health-related insights into their daily lives. When a user asks a question, the system automatically analyzes the full history of sensor data, extracts relevant information, and generates an appropriate response. However, existing systems are limited to short-duration (e.g., one minute) or low-frequency (e.g., daily step count) sensor data. In addition, they struggle with quantitative questions that require precise numerical answers. In this work, we introduce SensorChat, the first end-to-end QA system designed for daily life monitoring using long-duration, high-frequency time series data. Given raw sensor signals spanning multiple days and a user-defined natural language question, SensorChat generates semantically meaningful responses that directly address user concerns. SensorChat effectively handles both quantitative questions that require numerical precision and qualitative questions that require high-level reasoning to infer subjective insights. To achieve this, SensorChat uses an innovative three-stage pipeline including question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) to interpret human queries and generate responses. The intermediate querying stage extracts relevant information from the complete sensor data history. Real-world implementations demonstrate SensorChat's capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves 93% higher answer accuracy than the best performing state-of-the-art systems on quantitative questions. Furthermore, a user study with eight volunteers highlights SensorChat's effectiveness in answering qualitative questions.
中文: SensorChat是首个端到端问答系统,通过利用大型语言模型的三阶段流程处理长期高频传感器数据,既能精确回答定量问题,又能推理定性问题的主观洞察。
English: SensorChat is the first end-to-end QA system that processes long-duration, high-frequency sensor data to generate precise numerical answers for quantitative questions and infer subjective insights for qualitative questions through a three-stage pipeline leveraging LLMs.

Authors:Jiaqing Zhang, Mingjia Yin, Hao Wang, Yawen Li, Yuyang Ye, Xingyu Lou, Junping Du, Enhong Chen
Title: TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation
Abstract:
In the era of data-centric AI, the focus of recommender systems has shifted from model-centric innovations to data-centric approaches. The success of modern AI models is built on large-scale datasets, but this also results in significant training costs. Dataset distillation has emerged as a key solution, condensing large datasets to accelerate model training while preserving model performance. However, condensing discrete and sequentially correlated user-item interactions, particularly with extensive item sets, presents considerable challenges. This paper introduces \textbf{TD3}, a novel \textbf{T}ucker \textbf{D}ecomposition based \textbf{D}ataset \textbf{D}istillation method within a meta-learning framework, designed for sequential recommendation. TD3 distills a fully expressive \emph{synthetic sequence summary} from original data. To efficiently reduce computational complexity and extract refined latent patterns, Tucker decomposition decouples the summary into four factors: \emph{synthetic user latent factor}, \emph{temporal dynamics latent factor}, \emph{shared item latent factor}, and a \emph{relation core} that models their interconnections. Additionally, a surrogate objective in bi-level optimization is proposed to align feature spaces extracted from models trained on both original data and synthetic sequence summary beyond the naïve performance matching approach. In the \emph{inner-loop}, an augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the \emph{outer-loop}. To accelerate the optimization process and address long dependencies, RaT-BPTT is employed for bi-level optimization. Experiments and analyses on multiple public datasets have confirmed the superiority and cross-architecture generalizability of the proposed designs. Codes are released at https://github.com/USTC-StarTeam/TD3.
中文摘要:本文提出TD3,一种基于Tucker分解的数据集蒸馏方法,通过元学习和双层优化将用户-物品交互高效压缩为合成序列摘要,在保持模型性能的同时显著提升训练效率。
English Summary: This paper introduces TD3, a Tucker decomposition-based dataset distillation method for sequential recommendation that efficiently condenses user-item interactions into synthetic sequence summaries while preserving model performance through meta-learning and bi-level optimization.

Authors:Sunwoo Lee, Jaebak Hwang, Yonghyeon Jo, Seungyul Han
Title: Wolfpack Adversarial Attack for Robust Multi-Agent Reinforcement Learning
Abstract:
Traditional robust methods in multi-agent reinforcement learning (MARL) often struggle against coordinated adversarial attacks in cooperative scenarios. To address this limitation, we propose the Wolfpack Adversarial Attack framework, inspired by wolf hunting strategies, which targets an initial agent and its assisting agents to disrupt cooperation. Additionally, we introduce the Wolfpack-Adversarial Learning for MARL (WALL) framework, which trains robust MARL policies to defend against the proposed Wolfpack attack by fostering systemwide collaboration. Experimental results underscore the devastating impact of the Wolfpack attack and the significant robustness improvements achieved by WALL. Our code is available at https://github.com/sunwoolee0504/WALL.
中文: Wolfpack对抗攻击框架通过针对关键智能体破坏多智能体协作,而WALL框架则通过强化系统协作来训练具备抵御此类攻击的鲁棒策略。
English: The Wolfpack Adversarial Attack framework disrupts multi-agent cooperation by targeting key agents, while the WALL framework trains robust policies to defend against such attacks through enhanced collaboration.

Authors:Jeongmo Kim, Yisak Park, Minung Kim, Seungyul Han
Title: Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks
Abstract:
Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available at https://github.com/JM-Kim-94/tavt.git.
Chinese: 提出的任务感知虚拟训练(TAVT)算法通过基于度量的表示学习和状态正则化技术,精确捕捉任务特征,显著提升了在多种环境中对分布外任务的泛化能力。
English: The proposed Task-Aware Virtual Training (TAVT) algorithm enhances meta reinforcement learning by accurately capturing task characteristics through metric-based representation learning and state regularization, significantly improving generalization to out-of-distribution tasks in various environments.

Authors:Sunny Sanyal, Hayden Prairie, Rudrajit Das, Ali Kavis, Sujay Sanghavi
Title: Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting
Abstract:
Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a sample weighting scheme for the fine-tuning data solely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a $0.8\%$ drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving $5.4\%$ more accuracy on the pre-training datasets. Our code is publicly available at https://github.com/sanyalsunny111/FLOW_finetuning .
中文: 本文提出一种基于预训练模型损失的样本加权方法,通过侧重易学样本限制模型偏离,在适应新任务的同时有效缓解灾难性遗忘问题。
English: This paper introduces a sample weighting method that mitigates catastrophic forgetting during fine-tuning by prioritizing easy samples based on the pre-trained model's loss, preserving original capabilities while adapting to new tasks.

Authors:Calvin Yeung, Kenjiro Ide, Taiga Someya, Keisuke Fujii
Title: OpenSTARLab: Open Approach for Spatio-Temporal Agent Data Analysis in Soccer
Abstract:
Sports analytics has become both more professional and sophisticated, driven by the growing availability of detailed performance data. This progress enables applications such as match outcome prediction, player scouting, and tactical analysis. In soccer, the effective utilization of event and tracking data is fundamental for capturing and analyzing the dynamics of the game. However, there are two primary challenges: the limited availability of event data, primarily restricted to top-tier teams and leagues, and the scarcity and high cost of tracking data, which complicates its integration with event data for comprehensive analysis. Here we propose OpenSTARLab, an open-source framework designed to democratize spatio-temporal agent data analysis in sports by addressing these key challenges. OpenSTARLab includes the Pre-processing Package that standardizes event and tracking data through Unified and Integrated Event Data and State-Action-Reward formats, the Event Modeling Package that implements deep learning-based event prediction, alongside the RLearn Package for reinforcement learning tasks. These technical components facilitate the handling of diverse data sources and support advanced analytical tasks, thereby enhancing the overall functionality and usability of the framework. To assess OpenSTARLab's effectiveness, we conducted several experimental evaluations. These demonstrate the superior performance of the specific event prediction model in terms of action and time prediction accuracies and maintained its robust event simulation performance. Furthermore, reinforcement learning experiments reveal a trade-off between action accuracy and temporal difference loss and show comprehensive visualization. Overall, OpenSTARLab serves as a robust platform for researchers and practitioners, enhancing innovation and collaboration in the field of soccer data analytics.
中文: 体育分析在数据驱动应用中不断进步,但面临数据可获取性和整合的挑战;OpenSTARLab作为开源框架,通过标准化和解析时空数据,凭借强大的事件预测和强化学习功能,提升了足球数据分析的效能。
English: Sports analytics is advancing with data-driven applications, yet faces challenges in data accessibility and integration, which OpenSTARLab addresses as an open-source framework to standardize and analyze spatio-temporal data, enhancing soccer analytics through robust event prediction and reinforcement learning capabilities.

Authors:Obed Korshie Dzikunu, Shadab Ahamed, Amirhossein Toosi, Xiaoxiao Li, Arman Rahmim
Title: Adaptive Voxel-Weighted Loss Using L1 Norms in Deep Neural Networks for Detection and Segmentation of Prostate Cancer Lesions in PET/CT Images
Abstract:
This study proposes a new loss function for deep neural networks, L1-weighted Dice Focal Loss (L1DFL), that leverages L1 norms for adaptive weighting of voxels based on their classification difficulty, towards automated detection and segmentation of metastatic prostate cancer lesions in PET/CT scans. We obtained 380 PSMA [18-F] DCFPyL PET/CT scans of patients diagnosed with biochemical recurrence metastatic prostate cancer. We trained two 3D convolutional neural networks, Attention U-Net and SegResNet, and concatenated the PET and CT volumes channel-wise as input. The performance of our custom loss function was evaluated against the Dice and Dice Focal Loss functions. For clinical significance, we considered a detected region of interest (ROI) as a true positive if at least the voxel with the maximum standardized uptake value falls within the ROI. We assessed the models' performance based on the number of lesions in an image, tumour volume, activity, and extent of spread. The L1DFL outperformed the comparative loss functions by at least 13% on the test set. In addition, the F1 scores of the Dice Loss and the Dice Focal Loss were lower than that of L1DFL by at least 6% and 34%, respectively. The Dice Focal Loss yielded more false positives, whereas the Dice Loss was more sensitive to smaller volumes and struggled to segment larger lesions accurately. They also exhibited network-specific variations and yielded declines in segmentation accuracy with increased tumour spread. Our results demonstrate the potential of L1DFL to yield robust segmentation of metastatic prostate cancer lesions in PSMA PET/CT images. The results further highlight potential complexities arising from the variations in lesion characteristics that may influence automated prostate cancer tumour detection and segmentation. The code is publicly available at: https://github.com/ObedDzik/pca_segment.git.
本研究提出L1DFL新型损失函数,通过基于分类难度自适应加权体素来改进PET/CT扫描中转移性前列腺癌的自动分割,在测试集上相比现有方法性能提升至少13%。
This study introduces L1DFL, a novel loss function that enhances automated segmentation of metastatic prostate cancer in PET/CT scans by adaptively weighting voxels based on classification difficulty, achieving superior performance over existing methods with at least 13% improvement on test sets.

Authors:Hongwei Li, Yuheng Tang, Shiqi Wang, Wenbo Guo
Title: PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification
Abstract:
Recent research builds various patching agents that combine large language models (LLMs) with non-ML tools and achieve promising results on the state-of-the-art (SOTA) software patching benchmark, SWE-bench. Based on how to determine the patching workflows, existing patching agents can be categorized as agent-based planning methods, which rely on LLMs for planning, and rule-based planning methods, which follow a pre-defined workflow. At a high level, agent-based planning methods achieve high patching performance but with a high cost and limited stability. Rule-based planning methods, on the other hand, are more stable and efficient but have key workflow limitations that compromise their patching performance. In this paper, we propose PatchPilot, an agentic patcher that strikes a balance between patching efficacy, stability, and cost-efficiency. PatchPilot proposes a novel rule-based planning workflow with five components: reproduction, localization, generation, validation, and refinement (where refinement is unique to PatchPilot). We introduce novel and customized designs to each component to optimize their effectiveness and efficiency. Through extensive experiments on the SWE-bench benchmarks, PatchPilot shows a superior performance than existing open-source methods while maintaining low cost (less than 1$ per instance) and ensuring higher stability. We also conduct a detailed ablation study to validate the key designs in each component. Our code is available at https://github.com/ucsb-mlsec/PatchPilot.
Chinese: 近期研究将大语言模型与非机器学习工具结合,开发出基于代理和基于规则的软件补丁规划方法,而提出的PatchPilot在SWE-bench基准测试中实现了效能、稳定性和成本效益的卓越平衡。
English: Recent research combines large language models with non-ML tools to create software patching agents, categorized into agent-based and rule-based planning methods, with the proposed PatchPilot achieving a superior balance of efficacy, stability, and cost-efficiency on the SWE-bench benchmark.

Authors:Adibvafa Fallahpour, Jun Ma, Alif Munim, Hongwei Lyu, Bo Wang
Title: MedRAX: Medical Reasoning Agent for Chest X-ray
Abstract:
Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code have been publicly available at https://github.com/bowang-lab/MedRAX
Chinese: MedRAX是首个将先进胸片分析工具与多模态大语言模型整合为一体的多功能AI代理,无需额外训练即可在复杂医疗查询中实现最优性能。
English: MedRAX is the first versatile AI agent that integrates advanced chest X-ray analysis tools and multimodal large language models into a unified framework, achieving state-of-the-art performance in complex medical queries without requiring additional training.

Authors:Mayuka Jayawardhana, Renbo, Samuel Dooley, Valeriia Cherepanova, Andrew Gordon Wilson, Frank Hutter, Colin White, Tom Goldstein, Micah Goldblum
Title: Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes
Abstract:
Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at http://github.com/MayukaJ/LLM-Boost .
中文: 大语言模型和TabPFN在小规模表格数据上表现出色,而梯度提升决策树在大规模数据上更优,因此作者提出LLM-Boost和PFN-Boost融合方法,结合两者优势在不同规模数据集上均实现最优性能。
English: Large language models and TabPFN excel on small tabular datasets but are outperformed by gradient-boosted decision trees on larger ones, so the authors propose LLM-Boost and PFN-Boost fusion methods that combine their strengths to achieve state-of-the-art performance across various dataset sizes.

Authors:Yan Li, Tianyi Zhang, Zechuan Li, Soyeon Caren Han
Title: A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)
Abstract:
Transformer-based Large Language Models (LLMs) struggle with inputs exceeding their training context window due to positional out-of-distribution (O.O.D.) issues that disrupt attention. Existing solutions, including fine-tuning and training-free methods, face challenges like inefficiency, redundant interpolation, logit outliers, or loss of local positional information. We propose Greedy Attention Logit Interpolation (GALI), a training-free method that improves length extrapolation by greedily reusing pretrained positional intervals and interpolating attention logit to eliminate outliers. GALI achieves stable and superior performance across a wide range of long-context tasks without requiring input-length-specific tuning. Our analysis further reveals that LLMs interpret positional intervals unevenly and that restricting interpolation to narrower ranges improves performance, even on short-context tasks. GALI represents a step toward more robust and generalizable long-text processing in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/adlnlp/Gali.
中文: GALI是一种无需训练的方法,通过重用位置区间和插值注意力对数来提升大语言模型的长度外推能力,无需特定调优即可在长文本任务中实现稳定优越的性能。
English: GALI is a training-free method that enhances length extrapolation in LLMs by reusing positional intervals and interpolating attention logits, achieving stable performance across long-context tasks without specific tuning.

Authors:Yu-An Huang, Yao Hu, Yue-Chao Li, Xiyue Cao, Xinyuan Li, Kay Chen Tan, Zhu-Hong You, Zhi-An Huang
Title: scBIT: Integrating Single-cell Transcriptomic Data into fMRI-based Prediction for Alzheimer's Disease Diagnosis
Abstract:
Functional MRI (fMRI) and single-cell transcriptomics are pivotal in Alzheimer's disease (AD) research, each providing unique insights into neural function and molecular mechanisms. However, integrating these complementary modalities remains largely unexplored. Here, we introduce scBIT, a novel method for enhancing AD prediction by combining fMRI with single-nucleus RNA (snRNA). scBIT leverages snRNA as an auxiliary modality, significantly improving fMRI-based prediction models and providing comprehensive interpretability. It employs a sampling strategy to segment snRNA data into cell-type-specific gene networks and utilizes a self-explainable graph neural network to extract critical subgraphs. Additionally, we use demographic and genetic similarities to pair snRNA and fMRI data across individuals, enabling robust cross-modal learning. Extensive experiments validate scBIT's effectiveness in revealing intricate brain region-gene associations and enhancing diagnostic prediction accuracy. By advancing brain imaging transcriptomics to the single-cell level, scBIT sheds new light on biomarker discovery in AD research. Experimental results show that incorporating snRNA data into the scBIT model significantly boosts accuracy, improving binary classification by 3.39% and five-class classification by 26.59%. The codes were implemented in Python and have been released on GitHub (https://github.com/77YQ77/scBIT) and Zenodo (https://zenodo.org/records/11599030) with detailed instructions.
中文: scBIT方法通过整合功能性磁共振成像与单核RNA数据,显著提高了阿尔茨海默病的预测准确性,并为大脑区域与基因关联提供了可解释的洞见。
English: The scBIT method integrates fMRI with single-nucleus RNA data to significantly enhance Alzheimer's disease prediction accuracy and provide interpretable insights into brain region-gene associations.

Authors:Yu-An Huang, Yue-Chao Li, Hai-Ru You, Jie Pan, Xiyue Cao, Xinyuan Li, Zhi-An Huang, Zhu-Hong You
Title: Graph Structure Learning for Tumor Microenvironment with Cell Type Annotation from non-spatial scRNA-seq data
Abstract:
The exploration of cellular heterogeneity within the tumor microenvironment (TME) via single-cell RNA sequencing (scRNA-seq) is essential for understanding cancer progression and response to therapy. Current scRNA-seq approaches, however, lack spatial context and rely on incomplete datasets of ligand-receptor interactions (LRIs), limiting accurate cell type annotation and cell-cell communication (CCC) inference. This study addresses these challenges using a novel graph neural network (GNN) model that enhances cell type prediction and cell interaction analysis. Our study utilized a dataset consisting of 49,020 cells from 19 patients across three cancer types: Leukemia, Breast Invasive Carcinoma, and Colorectal Cancer. The proposed scGSL model demonstrated robust performance, achieving an average accuracy of 84.83%, precision of 86.23%, recall of 81.51%, and an F1 score of 80.92% across all datasets. These metrics represent a significant enhancement over existing methods, which typically exhibit lower performance metrics. Additionally, by reviewing existing literature on gene interactions within the TME, the scGSL model proves to robustly identify biologically meaningful gene interactions in an unsupervised manner, validated by significant expression differences in key gene pairs across various cancers. The source code and data used in this paper can be found in https://github.com/LiYuechao1998/scGSL.
中文: 本研究提出的scGSL模型采用图神经网络,通过单细胞RNA测序数据改进了肿瘤微环境中的细胞类型注释和细胞间通讯分析,在多种癌症类型中实现了高准确率和稳健性能。
English: This study introduces the scGSL model, a graph neural network that improves cell type annotation and cell-cell communication analysis in the tumor microenvironment using single-cell RNA sequencing data, achieving high accuracy and robust performance across multiple cancer types.

Authors:Philipp Hoellmer, Thomas Egg, Maya M. Martirossyan, Eric Fuemmeler, Zeren Shui, Amit Gupta, Pawan Prakash, Adrian Roitberg, Mingjie Liu, George Karypis, Mark Transtrum, Richard G. Hennig, Ellad B. Tadmor, Stefano Martiniani
Title: Open Materials Generation with Stochastic Interpolants
Abstract:
The discovery of new materials is essential for enabling technological advancements. Computational approaches for predicting novel materials must effectively learn the manifold of stable crystal structures within an infinite design space. We introduce Open Materials Generation (OMatG), a unifying framework for the generative design and discovery of inorganic crystalline materials. OMatG employs stochastic interpolants (SI) to bridge an arbitrary base distribution to the target distribution of inorganic crystals via a broad class of tunable stochastic processes, encompassing both diffusion models and flow matching as special cases. In this work, we adapt the SI framework by integrating an equivariant graph representation of crystal structures and extending it to account for periodic boundary conditions in unit cell representations. Additionally, we couple the SI flow over spatial coordinates and lattice vectors with discrete flow matching for atomic species. We benchmark OMatG's performance on two tasks: Crystal Structure Prediction (CSP) for specified compositions, and 'de novo' generation (DNG) aimed at discovering stable, novel, and unique structures. In our ground-up implementation of OMatG, we refine and extend both CSP and DNG metrics compared to previous works. OMatG establishes a new state of the art in generative modeling for materials discovery, outperforming purely flow-based and diffusion-based implementations. These results underscore the importance of designing flexible deep learning frameworks to accelerate progress in materials science. The OMatG code is available at https://github.com/FERMat-ML/OMatG.
中文摘要:OMatG是一种创新的生成框架,通过结合随机插值和等变图表示来推动材料发现,在预测和生成稳定无机晶体方面设立了新标杆。
English Summary: OMatG is a novel generative framework that combines stochastic interpolants with equivariant graph representations to advance materials discovery, setting a new benchmark in predicting and generating stable inorganic crystals.

Authors:Alex Flückiger, Chantal Amrhein, Tim Graf, Frédéric Odermatt, Martin Pömsl, Philippe Schläpfer, Florian Schottmann, Samuel Läubli
Title: A comparison of translation performance between DeepL and Supertext
Abstract:
As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems -- DeepL and Supertext -- by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at https://github.com/supertext/evaluation_deepl_supertext.
中文: 本研究通过上下文感知评估比较DeepL和Supertext机器翻译系统,发现Supertext在长文本中表现更稳定,并倡导采用更多文档级评估方法。
English: This study compares DeepL and Supertext machine translation systems using context-aware evaluation, revealing Supertext's superior consistency in longer texts and advocating for more document-level assessment methods.

Authors:Divya Bharti, Sriprabha Ramanarayanan, Sadhana S, Kishore Kumar M, Keerthi Ram, Harsh Agarwal, Ramesh Venkatesan, Mohanasankar Sivaprakasam
Title: AAD-DCE: An Aggregated Multimodal Attention Mechanism for Early and Late Dynamic Contrast Enhanced Prostate MRI Synthesis
Abstract:
Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE-MRI) is a medical imaging technique that plays a crucial role in the detailed visualization and identification of tissue perfusion in abnormal lesions and radiological suggestions for biopsy. However, DCE-MRI involves the administration of a Gadolinium based (Gad) contrast agent, which is associated with a risk of toxicity in the body. Previous deep learning approaches that synthesize DCE-MR images employ unimodal non-contrast or low-dose contrast MRI images lacking focus on the local perfusion information within the anatomy of interest. We propose AAD-DCE, a generative adversarial network (GAN) with an aggregated attention discriminator module consisting of global and local discriminators. The discriminators provide a spatial embedded attention map to drive the generator to synthesize early and late response DCE-MRI images. Our method employs multimodal inputs - T2 weighted (T2W), Apparent Diffusion Coefficient (ADC), and T1 pre-contrast for image synthesis. Extensive comparative and ablation studies on the ProstateX dataset show that our model (i) is agnostic to various generator benchmarks and (ii) outperforms other DCE-MRI synthesis approaches with improvement margins of +0.64 dB PSNR, +0.0518 SSIM, -0.015 MAE for early response and +0.1 dB PSNR, +0.0424 SSIM, -0.021 MAE for late response, and (ii) emphasize the importance of attention ensembling. Our code is available at https://github.com/bhartidivya/AAD-DCE.
中文摘要:AAD-DCE模型采用带聚合注意力判别器的生成对抗网络,通过多模态MRI输入合成DCE-MRI图像,在精度上超越现有方法并突显了注意力集成的重要性。
English Summary: The AAD-DCE model uses a GAN with aggregated attention discriminators to synthesize DCE-MRI images from multimodal MRI inputs, outperforming existing methods in accuracy and emphasizing attention ensembling's importance.

Authors:Jian Liu, Wei Sun, Hui Yang, Pengchao Deng, Chongpei Liu, Nicu Sebe, Hossein Rahmani, Ajmal Mian
Title: Diff9D: Diffusion-Based Domain-Generalized Category-Level 9-DoF Object Pose Estimation
Abstract:
Nine-degrees-of-freedom (9-DoF) object pose and size estimation is crucial for enabling augmented reality and robotic manipulation. Category-level methods have received extensive research attention due to their potential for generalization to intra-class unknown objects. However, these methods require manual collection and labeling of large-scale real-world training data. To address this problem, we introduce a diffusion-based paradigm for domain-generalized category-level 9-DoF object pose estimation. Our motivation is to leverage the latent generalization ability of the diffusion model to address the domain generalization challenge in object pose estimation. This entails training the model exclusively on rendered synthetic data to achieve generalization to real-world scenes. We propose an effective diffusion model to redefine 9-DoF object pose estimation from a generative perspective. Our model does not require any 3D shape priors during training or inference. By employing the Denoising Diffusion Implicit Model, we demonstrate that the reverse diffusion process can be executed in as few as 3 steps, achieving near real-time performance. Finally, we design a robotic grasping system comprising both hardware and software components. Through comprehensive experiments on two benchmark datasets and the real-world robotic system, we show that our method achieves state-of-the-art domain generalization performance. Our code will be made public at https://github.com/CNJianLiu/Diff9D.
中文摘要:本文提出一种基于扩散模型的九自由度物体姿态估计方法,仅使用合成数据训练即可通过三步去噪实现卓越的跨领域泛化性能,并在真实场景中达到领先水平。
English Summary: This paper introduces a diffusion-based method for category-level 9-DoF object pose estimation that trains solely on synthetic data yet achieves state-of-the-art generalization to real-world scenes through efficient 3-step denoising.

Authors:Antoni Kowalczuk, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
Title: Privacy Attacks on Image AutoRegressive Models
Abstract:
Image AutoRegressive generation has emerged as a new powerful paradigm with image autoregressive models (IARs) matching state-of-the-art diffusion models (DMs) in image quality (FID: 1.48 vs. 1.58) while allowing for a higher generation speed. However, the privacy risks associated with IARs remain unexplored, raising concerns regarding their responsible deployment. To address this gap, we conduct a comprehensive privacy analysis of IARs, comparing their privacy risks to the ones of DMs as reference points. Concretely, we develop a novel membership inference attack (MIA) that achieves a remarkably high success rate in detecting training images (with a True Positive Rate at False Positive Rate = 1% of 86.38% vs. 6.38% for DMs with comparable attacks). We leverage our novel MIA to provide dataset inference (DI) for IARs, and show that it requires as few as 6 samples to detect dataset membership (compared to 200 for DI in DMs), confirming a higher information leakage in IARs. Finally, we are able to extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30). Our results suggest a fundamental privacy-utility trade-off: while IARs excel in image generation quality and speed, they are empirically significantly more vulnerable to privacy attacks compared to DMs that achieve similar performance. We release the code at https://github.com/sprintml/privacy_attacks_against_iars for reproducibility.
中文: 图像自回归模型在图像质量和生成速度上媲美扩散模型,但隐私风险更高,新型成员推理攻击成功率显著且能提取大量训练数据,揭示了其严重的隐私泄露问题。
English: Image autoregressive models (IARs) match diffusion models in image quality with faster generation but are significantly more vulnerable to privacy attacks, as demonstrated by a novel membership inference attack achieving high success rates and data extraction.

Authors:Mengting Wei, Tuomas Varanka, Yante Li, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao
Title: Towards Consistent and Controllable Image Synthesis for Face Editing
Abstract:
Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models. Code is publicly available at https://github.com/weimengting/RigFace.
中文: 本文提出RigFace方法,结合稳定扩散模型和粗略三维人脸模型,在精确控制光照、表情和头部姿态的同时,有效保持身份特征一致性和图像真实感。
English: This paper introduces RigFace, a novel face editing approach that utilizes Stable-Diffusion models and 3D face models to precisely control lighting, expressions, and poses while maintaining identity consistency and photorealism.

Authors:Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt
Title: Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation
Abstract:
Retrieval, re-ranking, and retrieval-augmented generation (RAG) are critical components of modern applications in information retrieval, question answering, or knowledge-based text generation. However, existing solutions are often fragmented, lacking a unified framework that easily integrates these essential processes. The absence of a standardized implementation, coupled with the complexity of retrieval and re-ranking workflows, makes it challenging for researchers to compare and evaluate different approaches in a consistent environment. While existing toolkits such as Rerankers and RankLLM provide general-purpose reranking pipelines, they often lack the flexibility required for fine-grained experimentation and benchmarking. In response to these challenges, we introduce Rankify, a powerful and modular open-source toolkit designed to unify retrieval, re-ranking, and RAG within a cohesive framework. Rankify supports a wide range of retrieval techniques, including dense and sparse retrievers, while incorporating state-of-the-art re-ranking models to enhance retrieval quality. Additionally, Rankify includes a collection of pre-retrieved datasets to facilitate benchmarking, available at Huggingface (https://huggingface.co/datasets/abdoelsayed/reranking-datasets-light). To encourage adoption and ease of integration, we provide comprehensive documentation (http://rankify.readthedocs.io/), an open-source implementation on GitHub (https://github.com/DataScienceUIBK/rankify), and a PyPI package for easy installation (https://pypi.org/project/rankify/). As a unified and lightweight framework, Rankify allows researchers and practitioners to advance retrieval and re-ranking methodologies while ensuring consistency, scalability, and ease of use.
中文: Rankify是一个模块化的开源工具包,将检索、重排序和检索增强生成统一在一个集成框架中,为研究人员和从业者提供全面的工具和数据集,以促进一致的实验和基准测试。
English: Rankify is a modular open-source toolkit that unifies retrieval, re-ranking, and retrieval-augmented generation in a cohesive framework, offering comprehensive tools and datasets to facilitate consistent experimentation and benchmarking for researchers and practitioners.

Authors:Qianhao Yuan, Yanjiang Liu, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Title: SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency
Abstract:
Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on these insights, we introduce SAISA (\textbf{S}elf-\textbf{A}ttention \textbf{I}nput \textbf{S}pace \textbf{A}lignment), a novel architecture that enhance both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAAViT self-attention blocks, reducing computational overhead in both self-attention blocks and feed-forward networks (FFNs). Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66\% and training budget by 26\%, while achieving superior performance in terms of accuracy. Comprehensive ablation studies further validate the effectiveness of SAISA across various LLMs and visual encoders. The code and model will be publicly available at https://github.com/icip-cas/SAISA.
中文: 本文提出SAISA这一新型多模态大语言模型架构,通过消除视觉标记间的注意力机制,在提升模型精度的同时显著降低了训练与推理的计算成本。
English: This paper introduces SAISA, a novel multimodal large language model architecture that eliminates attention among visual tokens to significantly enhance both training and inference efficiency while achieving superior accuracy compared to existing models.

Authors:Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade
Title: Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants
Abstract:
Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.
中文摘要:本文将新型深度学习优化器与随机梯度下降理论加速相联系,验证了AdEMAMix的优越性能,并提出简化版AdEMAMix,在保持性能的同时精简了动量项设计。
English Summary: This paper connects recent deep learning optimizers with theoretical SGD acceleration, demonstrating AdEMAMix's superior performance and introducing Simplified-AdEMAMix which maintains performance while simplifying momentum terms.

Authors:Chenhui Zhao, Yan Jiang, Todd C. Hollon
Title: Extending SEEDS to a Supervoxel Algorithm for Medical Image Analysis
Abstract:
In this work, we extend the SEEDS superpixel algorithm from 2D images to 3D volumes, resulting in 3D SEEDS, a faster, better, and open-source supervoxel algorithm for medical image analysis. We compare 3D SEEDS with the widely used supervoxel algorithm SLIC on 13 segmentation tasks across 10 organs. 3D SEEDS accelerates supervoxel generation by a factor of 10, improves the achievable Dice score by +6.5%, and reduces the under-segmentation error by -0.16%. The code is available at https://github.com/Zch0414/3d_seeds
本研究提出3D SEEDS算法,这一开源超体素方法在医学图像分析中比SLIC提速十倍,并将分割精度提升6.5%,同时代码已公开共享。
This study introduces 3D SEEDS, an enhanced open-source supervoxel algorithm that accelerates processing tenfold and improves segmentation accuracy by 6.5% compared to SLIC across multiple medical imaging tasks.

Authors:Ibrahim Bouabdallaoui, Fatima Guerouate, Samya Bouhaddour, Chaimae Saadi, Mohammed Sbihi
Title: FewTopNER: Integrating Few-Shot Learning with Topic Modeling and Named Entity Recognition in a Multilingual Framework
Abstract:
We introduce FewTopNER, a novel framework that integrates few-shot named entity recognition (NER) with topic-aware contextual modeling to address the challenges of cross-lingual and low-resource scenarios. FewTopNER leverages a shared multilingual encoder based on XLM-RoBERTa, augmented with language-specific calibration mechanisms, to generate robust contextual embeddings. The architecture comprises a prototype-based entity recognition branch, employing BiLSTM and Conditional Random Fields for sequence labeling, and a topic modeling branch that extracts document-level semantic features through hybrid probabilistic and neural methods. A cross-task bridge facilitates dynamic bidirectional attention and feature fusion between entity and topic representations, thereby enhancing entity disambiguation by incorporating global semantic context. Empirical evaluations on multilingual benchmarks across English, French, Spanish, German, and Italian demonstrate that FewTopNER significantly outperforms existing state-of-the-art few-shot NER models. In particular, the framework achieves improvements of 2.5-4.0 percentage points in F1 score and exhibits enhanced topic coherence, as measured by normalized pointwise mutual information. Ablation studies further confirm the critical contributions of the shared encoder and cross-task integration mechanisms to the overall performance. These results underscore the efficacy of incorporating topic-aware context into few-shot NER and highlight the potential of FewTopNER for robust cross-lingual applications in low-resource settings.
中文摘要:FewTopNER是一种创新框架,通过融合主题感知上下文建模和跨任务特征交互,显著提升了低资源场景下多语言小样本命名实体识别的性能表现。
English Summary: FewTopNER is a novel framework that enhances few-shot named entity recognition by integrating topic-aware contextual modeling and cross-task feature fusion, achieving superior performance across multiple languages in low-resource scenarios.

Authors:Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu
Title: STAIR: Improving Safety Alignment with Introspective Reasoning
Abstract:
Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.
中文摘要:STAIR是一种新颖框架,通过结合自省推理和迭代优化来增强大语言模型的安全性,相比传统对齐方法,在减少有害输出的同时更好地保持了实用性。
English Summary: STAIR is a novel framework that enhances LLM safety by integrating introspective reasoning and iterative optimization, effectively reducing harmful outputs while maintaining helpfulness compared to traditional alignment methods.

Authors:Alexander Kolesov, Manukhov Stepan, Vladimir V. Palyulin, Alexander Korotin
Title: Field Matching: an Electrostatic Paradigm to Generate and Transfer Data
Abstract:
We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modeling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. Then we learn the electrostatic field of the capacitor using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments. Our code is available at https://github.com/justkolesov/FieldMatching
中文摘要:静电匹配场(EFM)是一种受电容器物理启发的新型分布转移方法,通过神经网络学习静电场,使样本沿电场线移动,从而可靠地实现分布间的相互映射。
English Summary: Electrostatic Field Matching (EFM) is a novel distribution transfer method inspired by capacitor physics, using neural networks to learn electrostatic fields that provably map source and target distributions by moving samples along field lines.

Authors:Josua Faller, Jörg Martin
Title: Optimal Subspace Inference for the Laplace Approximation of Bayesian Neural Networks
Abstract:
Subspace inference for neural networks assumes that a subspace of their parameter space suffices to produce a reliable uncertainty quantification. In this work, we mathematically derive the optimal subspace model to a Bayesian inference scenario based on the Laplace approximation. We demonstrate empirically that, in the optimal case, often a fraction of parameters less than 1% is sufficient to obtain a reliable estimate of the full Laplace approximation. Since the optimal solution is derived, we can evaluate all other subspace models against a baseline. In addition, we give an approximation of our method that is applicable to larger problem settings, in which the optimal solution is not computable, and compare it to existing subspace models from the literature. In general, our approximation scheme outperforms previous work. Furthermore, we present a metric to qualitatively compare different subspace models even if the exact Laplace approximation is unknown.
中文: 本研究通过拉普拉斯近似数学推导出贝叶斯神经网络推理的最优子空间模型,证明仅需不到1%参数即可可靠估计全模型不确定性,且性能优于现有方法。
English: This study mathematically derives the optimal subspace model for Bayesian neural network inference using Laplace approximation, demonstrating that under 1% of parameters can reliably estimate full-model uncertainty while outperforming existing methods.

Authors:Shangwei Guo, Hao Shi, Song Wang, Xiaoting Yin, Kailun Yang, Kaiwei Wang
Title: Event-aided Semantic Scene Completion
Abstract:
Autonomous driving systems rely on robust 3D scene understanding. Recent advances in Semantic Scene Completion (SSC) for autonomous driving underscore the limitations of RGB-based approaches, which struggle under motion blur, poor lighting, and adverse weather. Event cameras, offering high dynamic range and low latency, address these challenges by providing asynchronous data that complements RGB inputs. We present DSEC-SSC, the first real-world benchmark specifically designed for event-aided SSC, which includes a novel 4D labeling pipeline for generating dense, visibility-aware labels that adapt dynamically to object motion. Our proposed RGB-Event fusion framework, EvSSC, introduces an Event-aided Lifting Module (ELM) that effectively bridges 2D RGB-Event features to 3D space, enhancing view transformation and the robustness of 3D volume construction across SSC models. Extensive experiments on DSEC-SSC and simulated SemanticKITTI-E demonstrate that EvSSC is adaptable to both transformer-based and LSS-based SSC architectures. Notably, evaluations on SemanticKITTI-C demonstrate that EvSSC achieves consistently improved prediction accuracy across five degradation modes and both In-domain and Out-of-domain settings, achieving up to a 52.5% relative improvement in mIoU when the image sensor partially fails. Additionally, we quantitatively and qualitatively validate the superiority of EvSSC under motion blur and extreme weather conditions, where autonomous driving is challenged. The established datasets and our codebase will be made publicly at https://github.com/Pandapan01/EvSSC.
中文摘要:DSEC-SSC基准与EvSSC框架通过融合事件相机数据与RGB输入,显著提升了自动驾驶系统在传感器故障、运动模糊及恶劣天气下的三维场景理解鲁棒性。
English Summary: The DSEC-SSC benchmark and EvSSC framework enhance autonomous driving's 3D scene understanding by fusing event camera data with RGB inputs, significantly improving robustness against sensor degradation, motion blur, and adverse weather conditions.

Authors:Hsin-Cheng Lu, Chung-Yi Lin, Winston H. Hsu
Title: Improving Generalization Ability for 3D Object Detection by Learning Sparsity-invariant Features
Abstract:
In autonomous driving, 3D object detection is essential for accurately identifying and tracking objects. Despite the continuous development of various technologies for this task, a significant drawback is observed in most of them-they experience substantial performance degradation when detecting objects in unseen domains. In this paper, we propose a method to improve the generalization ability for 3D object detection on a single domain. We primarily focus on generalizing from a single source domain to target domains with distinct sensor configurations and scene distributions. To learn sparsity-invariant features from a single source domain, we selectively subsample the source data to a specific beam, using confidence scores determined by the current detector to identify the density that holds utmost importance for the detector. Subsequently, we employ the teacher-student framework to align the Bird's Eye View (BEV) features for different point clouds densities. We also utilize feature content alignment (FCA) and graph-based embedding relationship alignment (GERA) to instruct the detector to be domain-agnostic. Extensive experiments demonstrate that our method exhibits superior generalization capabilities compared to other baselines. Furthermore, our approach even outperforms certain domain adaptation methods that can access to the target domain data.
Chinese Summary: 本文提出了一种方法,通过从单一源域学习稀疏不变特征,并采用师生框架和特征对齐技术,提升自动驾驶中3D物体检测在未见领域的泛化能力。
English Summary: The paper introduces a method to enhance the generalization of 3D object detection in autonomous driving by learning sparsity-invariant features from a single source domain and employing techniques like teacher-student framework and feature alignment to improve performance across unseen domains.

Authors:Jiawei Qin, Xucong Zhang, Yusuke Sugano
Title: UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training
Abstract:
Despite decades of research on data collection and model architectures, current gaze estimation models encounter significant challenges in generalizing across diverse data domains. Recent advances in self-supervised pre-training have shown remarkable performances in generalization across various vision tasks. However, their effectiveness in gaze estimation remains unexplored. We propose UniGaze, for the first time, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training. Through systematic investigation, we clarify critical factors that are essential for effective pretraining in gaze estimation. Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation, while our carefully designed pre-training pipeline consistently improves cross-domain performance. Through comprehensive experiments of challenging cross-dataset evaluation and novel protocols including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. source code and model are available at https://github.com/ut-vision/UniGaze.
中文: UniGaze首次通过自监督预训练利用大规模野外面部数据集进行视线估计,显著提升了跨领域泛化能力并降低了对标注数据的依赖。
English: UniGaze introduces a novel self-supervised pre-training approach using large-scale facial datasets to significantly enhance gaze estimation's cross-domain generalization while reducing dependence on labeled data.

Authors:Tao Zhang, Jinyong Wen, Zhen Chen, Kun Ding, Shiming Xiang, Chunhong Pan
Title: UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation
Abstract:
Pre-training techniques significantly enhance the performance of semantic segmentation tasks with limited training data. However, the efficacy under a large domain gap between pre-training (e.g. RGB) and fine-tuning (e.g. infrared) remains underexplored. In this study, we first benchmark the infrared semantic segmentation performance of various pre-training methods and reveal several phenomena distinct from the RGB domain. Next, our layerwise analysis of pre-trained attention maps uncovers that: (1) There are three typical attention patterns (local, hybrid, and global); (2) Pre-training tasks notably influence the pattern distribution across layers; (3) The hybrid pattern is crucial for semantic segmentation as it attends to both nearby and foreground elements; (4) The texture bias impedes model generalization in infrared tasks. Building on these insights, we propose UNIP, a UNified Infrared Pre-training framework, to enhance the pre-trained model performance. This framework uses the hybrid-attention distillation NMI-HAD as the pre-training target, a large-scale mixed dataset InfMix for pre-training, and a last-layer feature pyramid network LL-FPN for fine-tuning. Experimental results show that UNIP outperforms various pre-training methods by up to 13.5\% in average mIoU on three infrared segmentation tasks, evaluated using fine-tuning and linear probing metrics. UNIP-S achieves performance on par with MAE-L while requiring only 1/10 of the computational cost. Furthermore, UNIP significantly surpasses state-of-the-art (SOTA) infrared or RGB segmentation methods and demonstrates broad potential for application in other modalities, such as RGB and depth. Our code is available at https://github.com/casiatao/UNIP.
Chinese: 预训练技术能提升有限数据下的语义分割性能,但在RGB与红外等大领域差异下的效果尚不明确,为此提出的统一红外预训练框架UNIP显著超越了现有方法,在性能和效率上均表现出色。
English: Pre-training boosts semantic segmentation with limited data, but its effectiveness across large domain gaps like RGB to infrared remains unclear, leading to the development of UNIP, a unified infrared pre-training framework that significantly outperforms existing methods in performance and efficiency.

Authors:Chenhao Zhai, Chang Meng, Yu Yang, Kexin Zhang, Xuhao Zhao, Xiu Li
Title: Combinatorial Optimization Perspective based Framework for Multi-behavior Recommendation
Abstract:
In real-world recommendation scenarios, users engage with items through various types of behaviors. Leveraging diversified user behavior information for learning can enhance the recommendation of target behaviors (e.g., buy), as demonstrated by recent multi-behavior methods. The mainstream multi-behavior recommendation framework consists of two steps: fusion and prediction. Recent approaches utilize graph neural networks for multi-behavior fusion and employ multi-task learning paradigms for joint optimization in the prediction step, achieving significant success. However, these methods have limited perspectives on multi-behavior fusion, which leads to inaccurate capture of user behavior patterns in the fusion step. Moreover, when using multi-task learning for prediction, the relationship between the target task and auxiliary tasks is not sufficiently coordinated, resulting in negative information transfer. To address these problems, we propose a novel multi-behavior recommendation framework based on the combinatorial optimization perspective, named COPF. Specifically, we treat multi-behavior fusion as a combinatorial optimization problem, imposing different constraints at various stages of each behavior to restrict the solution space, thus significantly enhancing fusion efficiency (COGCN). In the prediction step, we improve both forward and backward propagation during the generation and aggregation of multiple experts to mitigate negative transfer caused by differences in both feature and label distributions (DFME). Comprehensive experiments on three real-world datasets indicate the superiority of COPF. Further analyses also validate the effectiveness of the COGCN and DFME modules. Our code is available at https://github.com/1918190/COPF.
中文摘要:提出的COPF框架通过将多行为融合视为组合优化问题并改进多任务学习以防止负迁移,解决了多行为推荐中的现有局限。
English Summary: The proposed COPF framework addresses limitations in multi-behavior recommendation by treating behavior fusion as a combinatorial optimization problem and improving multi-task learning to prevent negative transfer.

Authors:Giovanni Birolo, Ivan Rossi, Flavio Sartori, Cesare Rollo, Tiziana Sanavia, Piero Fariselli
Title: SurvHive: a package to consistently access multiple survival-analysis packages
Abstract:
Survival analysis, a foundational tool for modeling time-to-event data, has seen growing integration with machine learning (ML) approaches to handle the complexities of censored data and time-varying risks. Despite these advances, leveraging state-of-the-art survival models remains a challenge due to the fragmented nature of existing implementations, which lack standardized interfaces and require extensive preprocessing. We introduce SurvHive, a Python-based framework designed to unify survival analysis methods within a coherent and extensible interface modeled on scikit-learn. SurvHive integrates classical statistical models with cutting-edge deep learning approaches, including transformer-based architectures and parametric survival models. Using a consistent API, SurvHive simplifies model training, evaluation, and optimization, significantly reducing the barrier to entry for ML practitioners exploring survival analysis. The package includes enhanced support for hyper-parameter tuning, time-dependent risk evaluation metrics, and cross-validation strategies tailored to censored data. With its extensibility and focus on usability, SurvHive provides a bridge between survival analysis and the broader ML community, facilitating advancements in time-to-event modeling across domains. The SurvHive code and documentation are available freely at https://github.com/compbiomed-unito/survhive.
中文: SurvHive是一个基于Python的框架,通过类scikit-learn的统一接口整合了传统统计方法与深度学习技术,显著降低了生存分析的应用门槛,并提供专门处理删失数据的优化工具。
English: SurvHive is a Python framework that unifies classical and modern survival analysis methods with a scikit-learn-inspired interface, simplifying model development and evaluation while providing specialized tools for handling censored data.

Authors:Dexiong Chen, Markus Krimmel, Karsten Borgwardt
Title: Flatten Graphs as Sequences: Transformers are Scalable Graph Generators
Abstract:
We introduce AutoGraph, a scalable autoregressive model for attributed graph generation using decoder-only transformers. By flattening graphs into random sequences of tokens through a reversible process, AutoGraph enables modeling graphs as sequences without relying on additional node features that are expensive to compute, in contrast to diffusion-based approaches. This results in sampling complexity and sequence lengths that scale optimally linearly with the number of edges, making it scalable and efficient for large, sparse graphs. A key success factor of AutoGraph is that its sequence prefixes represent induced subgraphs, creating a direct link to sub-sentences in language modeling. Empirically, AutoGraph achieves state-of-the-art performance on synthetic and molecular benchmarks, with up to 100x faster generation and 3x faster training than leading diffusion models. It also supports substructure-conditioned generation without fine-tuning and shows promising transferability, bridging language modeling and graph generation to lay the groundwork for graph foundation models. Our code is available at https://github.com/BorgwardtLab/AutoGraph.
中文: AutoGraph是一种可扩展的自回归模型,通过将图表示为令牌序列,利用仅解码器变换器高效生成属性图,在合成和分子基准测试中达到最先进性能,训练和生成速度远超扩散模型。
English: AutoGraph is a scalable autoregressive model that uses decoder-only transformers to generate attributed graphs efficiently by representing them as token sequences, achieving state-of-the-art performance with significantly faster training and generation than diffusion models.

Authors:Peiyan Hu, Xiaowei Qian, Wenhao Deng, Rui Wang, Haodong Feng, Ruiqi Feng, Tao Zhang, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu
Title: From Uncertain to Safe: Conformal Fine-Tuning of Diffusion Models for Safe PDE Control
Abstract:
The application of deep learning for partial differential equation (PDE)-constrained control is gaining increasing attention. However, existing methods rarely consider safety requirements crucial in real-world applications. To address this limitation, we propose Safe Diffusion Models for PDE Control (SafeDiffCon), which introduce the uncertainty quantile as model uncertainty quantification to achieve optimal control under safety constraints through both post-training and inference phases. Firstly, our approach post-trains a pre-trained diffusion model to generate control sequences that better satisfy safety constraints while achieving improved control objectives via a reweighted diffusion loss, which incorporates the uncertainty quantile estimated using conformal prediction. Secondly, during inference, the diffusion model dynamically adjusts both its generation process and parameters through iterative guidance and fine-tuning, conditioned on control targets while simultaneously integrating the estimated uncertainty quantile. We evaluate SafeDiffCon on three control tasks: 1D Burgers' equation, 2D incompressible fluid, and controlled nuclear fusion problem. Results demonstrate that SafeDiffCon is the only method that satisfies all safety constraints, whereas other classical and deep learning baselines fail. Furthermore, while adhering to safety constraints, SafeDiffCon achieves the best control performance. The code can be found at https://github.com/AI4Science-WestlakeU/safediffcon.
中文摘要:提出的SafeDiffCon框架在深度学习求解偏微分方程控制问题中引入基于不确定度分位数的安全约束,通过训练后优化和推理调整,在多种控制任务中实现安全约束下的最优控制性能。
English Summary: The proposed SafeDiffCon framework introduces uncertainty quantile-based safety constraints in deep learning for PDE control, achieving optimal performance while ensuring safety through post-training and inference adjustments across various control tasks.

Authors:Fei Wang, Kun Li, Yiqi Nie, Zhangling Duan, Peng Zou, Zhiliang Wu, Yuwei Wang, Yanyan Wei
Title: Exploiting Ensemble Learning for Cross-View Isolated Sign Language Recognition
Abstract:
In this paper, we present our solution to the Cross-View Isolated Sign Language Recognition (CV-ISLR) challenge held at WWW 2025. CV-ISLR addresses a critical issue in traditional Isolated Sign Language Recognition (ISLR), where existing datasets predominantly capture sign language videos from a frontal perspective, while real-world camera angles often vary. To accurately recognize sign language from different viewpoints, models must be capable of understanding gestures from multiple angles, making cross-view recognition challenging. To address this, we explore the advantages of ensemble learning, which enhances model robustness and generalization across diverse views. Our approach, built on a multi-dimensional Video Swin Transformer model, leverages this ensemble strategy to achieve competitive performance. Finally, our solution ranked 3rd in both the RGB-based ISLR and RGB-D-based ISLR tracks, demonstrating the effectiveness in handling the challenges of cross-view recognition. The code is available at: https://github.com/Jiafei127/CV_ISLR_WWW2025.
中文摘要:我们针对跨视角孤立手语识别挑战提出的解决方案,采用基于视频Swin Transformer的集成学习方法,通过增强模型对不同视角的泛化能力,在两项赛道中均获得第三名。
English Summary: Our solution for the Cross-View Isolated Sign Language Recognition challenge employs an ensemble learning approach with a Video Swin Transformer model, achieving third place in both competition tracks by effectively recognizing signs from varied camera angles.

Authors:Daniel Tamayo, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas
Title: Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge
Abstract:
Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass-Editing Memory with Attention in Transformers (MEMAT), a method that achieves significant improvements in all metrics while requiring minimal parameter modifications. MEMAT delivers a remarkable 10% increase in magnitude metrics, benefits languages not included in the training data and also demonstrates a high degree of portability. Our code and data are at https://github.com/dtamayo-nlp/MEMAT.
中文摘要:本研究提出MEMAT新方法,通过利用注意力机制显著提升大语言模型的知识编辑能力,在仅需少量参数修改的情况下实现指标10%的提升,并能惠及未训练语言。
English Summary: This study introduces MEMAT, a novel method that significantly enhances knowledge editing in large language models by leveraging attention mechanisms, achieving a 10% improvement in metrics while requiring minimal parameter changes and benefiting untrained languages.

Authors:Ruiqi Feng, Chenglei Yu, Wenhao Deng, Peiyan Hu, Tailin Wu
Title: On the Guidance of Flow Matching
Abstract:
Flow matching has shown state-of-the-art performance in various generative tasks, ranging from image generation to decision-making, where generation under energy guidance (abbreviated as guidance in the following) is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found at https://github.com/AI4Science-WestlakeU/flow_guidance.
This paper introduces the first comprehensive framework for general guidance in flow matching, deriving various guidance techniques including training-free exact guidance, novel training-based methods, and approximate approaches, with theoretical analysis and experimental validation across multiple domains.
English Summary:

Authors:Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz
Title: From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios
Abstract:
Ensuring the safety of autonomous vehicles requires virtual scenario-based testing, which depends on the robust evaluation and generation of safety-critical scenarios. So far, researchers have used scenario-based testing frameworks that rely heavily on handcrafted scenarios as safety metrics. To reduce the effort of human interpretation and overcome the limited scalability of these approaches, we combine Large Language Models (LLMs) with structured scenario parsing and prompt engineering to automatically evaluate and generate safety-critical driving scenarios. We introduce Cartesian and Ego-centric prompt strategies for scenario evaluation, and an adversarial generation module that modifies trajectories of risk-inducing vehicles (ego-attackers) to create critical scenarios. We validate our approach using a 2D simulation framework and multiple pre-trained LLMs. The results show that the evaluation module effectively detects collision scenarios and infers scenario safety. Meanwhile, the new generation module identifies high-risk agents and synthesizes realistic, safety-critical scenarios. We conclude that an LLM equipped with domain-informed prompting techniques can effectively evaluate and generate safety-critical driving scenarios, reducing dependence on handcrafted metrics. We release our open-source code and scenarios at: https://github.com/TUM-AVS/From-Words-to-Collisions.
中文摘要:本研究将大型语言模型与结构化场景解析及提示工程相结合,自动评估和生成安全关键驾驶场景,有效降低对人工方法的依赖,并在碰撞检测和真实场景合成方面展现出卓越性能。
English Summary: This study integrates Large Language Models with structured parsing and prompt engineering to automatically evaluate and generate safety-critical driving scenarios, effectively reducing reliance on manual methods while demonstrating high performance in collision detection and realistic scenario synthesis.

Authors:Alan Oursland
Title: Neural Networks Learn Distance Metrics
Abstract:
Neural networks may naturally favor distance-based representations, where smaller activations indicate closer proximity to learned prototypes. This contrasts with intensity-based approaches, which rely on activation magnitudes. To test this hypothesis, we conducted experiments with six MNIST architectural variants constrained to learn either distance or intensity representations. Our results reveal that the underlying representation affects model performance. We develop a novel geometric framework that explains these findings and introduce OffsetL2, a new architecture based on Mahalanobis distance equations, to further validate this framework. This work highlights the importance of considering distance-based learning in neural network design.
中文摘要:神经网络天然倾向于基于距离的表征而非强度表征,通过MNIST架构变体的实验验证了这一点,并提出了几何框架和OffsetL2新架构来支持这一发现,强调了距离学习在神经网络设计中的重要性。
English Summary: Neural networks inherently favor distance-based representations over intensity-based ones, as demonstrated by experiments with MNIST variants, leading to the development of a geometric framework and the OffsetL2 architecture to validate this approach.

Authors:Zaid Ilyas, Arooba Maqsood, Afsah Saleem, Erchuan Zhang, David Suter, Parminder Raina, Jonathan M. Hodgson, John T. Schousboe, William D. Leslie, Joshua R. Lewis, Syed Zulqarnain Gilani
Title: VerteNet -- A Multi-Context Hybrid CNN Transformer for Accurate Vertebral Landmark Localization in Lateral Spine DXA Images
Abstract:
Lateral Spine Image (LSI) analysis is important for medical diagnosis, treatment planning, and detailed spinal health assessments. Although modalities like Computed Tomography and Digital X-ray Imaging are commonly used, Dual Energy X-ray Absorptiometry (DXA) is often preferred due to lower radiation exposure, seamless capture, and cost-effectiveness. Accurate Vertebral Landmark Localization (VLL) on LSIs is important to detect spinal conditions like kyphosis and lordosis, as well as assessing Abdominal Aortic Calcification (AAC) using Inter-Vertebral Guides (IVGs). Nonetheless, few automated VLL methodologies have concentrated on DXA LSIs. We present VerteNet, a hybrid CNN-Transformer model featuring a novel dual-resolution attention mechanism in self and cross-attention domains, referred to as Dual Resolution Self-Attention (DRSA) and Dual Resolution Cross-Attention (DRCA). These mechanisms capture the diverse frequencies in DXA images by operating at two different feature map resolutions. Additionally, we design a Multi-Context Feature Fusion Block (MCFB) that efficiently integrates the features using DRSA and DRCA. We train VerteNet on 620 DXA LSIs from various machines and achieve superior results compared to existing methods. We also design an algorithm that utilizes VerteNet's predictions in estimating the Region of Interest (ROI) to detect potential abdominal aorta cropping, where inadequate soft tissue hinders calcification assessment. Additionally, we present a small proof-of-concept study to show that IVGs generated from VLL information can improve inter-reader correlation in AAC scoring, addressing two key areas of disagreement in expert AAC-24 scoring: IVG placement and quality control for full abdominal aorta assessment. The code for this work can be found at https://github.com/zaidilyas89/VerteNet.
Chinese: VerteNet是一种混合CNN-Transformer模型,通过双分辨率注意力机制在DXA脊柱侧位图像上实现精准椎骨定位,其性能优于现有方法,并能有效提升腹主动脉钙化评估的准确性。
English: VerteNet is a hybrid CNN-Transformer model that introduces dual-resolution attention mechanisms for accurate vertebral landmark localization on DXA lateral spine images, achieving superior performance and enabling improved abdominal aortic calcification assessment.

Authors:Mokshagna Sai Teja Karanam, Krithika Iyer, Sarang Joshi, Shireen Elhabian
Title: MORPH-LER: Log-Euclidean Regularization for Population-Aware Image Registration
Abstract:
Spatial transformations that capture population-level morphological statistics are critical for medical image analysis. Commonly used smoothness regularizers for image registration fail to integrate population statistics, leading to anatomically inconsistent transformations. Inverse consistency regularizers promote geometric consistency but lack population morphometrics integration. Regularizers that constrain deformation to low-dimensional manifold methods address this. However, they prioritize reconstruction over interpretability and neglect diffeomorphic properties, such as group composition and inverse consistency. We introduce MORPH-LER, a Log-Euclidean regularization framework for population-aware unsupervised image registration. MORPH-LER learns population morphometrics from spatial transformations to guide and regularize registration networks, ensuring anatomically plausible deformations. It features a bottleneck autoencoder that computes the principal logarithm of deformation fields via iterative square-root predictions. It creates a linearized latent space that respects diffeomorphic properties and enforces inverse consistency. By integrating a registration network with a diffeomorphic autoencoder, MORPH-LER produces smooth, meaningful deformation fields. The framework offers two main contributions: (1) a data-driven regularization strategy that incorporates population-level anatomical statistics to enhance transformation validity and (2) a linearized latent space that enables compact and interpretable deformation fields for efficient population morphometrics analysis. We validate MORPH-LER across two families of deep learning-based registration networks, demonstrating its ability to produce anatomically accurate, computationally efficient, and statistically meaningful transformations on the OASIS-1 brain imaging dataset. https://github.com/iyerkrithika21/MORPH_LER
中文摘要:MORPH-LER是一种新颖的对数欧几里得正则化框架,通过引入保持微分同胚特性的线性化潜空间,将群体形态统计融入无监督图像配准,确保产生解剖学合理的形变场。
English Summary: MORPH-LER is a novel Log-Euclidean regularization framework that integrates population morphometrics into unsupervised image registration, ensuring anatomically plausible deformations through a diffeomorphic autoencoder with a linearized latent space.

Authors:Hanlin Wu, Yuxuan Song, Jingjing Gong, Ziyao Cao, Yawen Ouyang, Jianbing Zhang, Hao Zhou, Wei-Ying Ma, Jingjing Liu
Title: A Periodic Bayesian Flow for Material Generation
Abstract:
Generative modeling of crystal data distribution is an important yet challenging task due to the unique periodic physical symmetry of crystals. Diffusion-based methods have shown early promise in modeling crystal distribution. More recently, Bayesian Flow Networks were introduced to aggregate noisy latent variables, resulting in a variance-reduced parameter space that has been shown to be advantageous for modeling Euclidean data distributions with structural constraints (Song et al., 2023). Inspired by this, we seek to unlock its potential for modeling variables located in non-Euclidean manifolds e.g. those within crystal structures, by overcoming challenging theoretical issues. We introduce CrysBFN, a novel crystal generation method by proposing a periodic Bayesian flow, which essentially differs from the original Gaussian-based BFN by exhibiting non-monotonic entropy dynamics. To successfully realize the concept of periodic Bayesian flow, CrysBFN integrates a new entropy conditioning mechanism and empirically demonstrates its significance compared to time-conditioning. Extensive experiments over both crystal ab initio generation and crystal structure prediction tasks demonstrate the superiority of CrysBFN, which consistently achieves new state-of-the-art on all benchmarks. Surprisingly, we found that CrysBFN enjoys a significant improvement in sampling efficiency, e.g., ~100x speedup 10 v.s. 2000 steps network forwards) compared with previous diffusion-based methods on MP-20 dataset. Code is available at https://github.com/wu-han-lin/CrysBFN.
中文: CrysBFN提出了一种新颖的周期性贝叶斯流方法,通过解决非欧几里得流形建模的理论难题,在晶体生成任务中实现了最优性能,并显著提升了采样效率。
English: CrysBFN introduces a novel periodic Bayesian flow for crystal generation, overcoming theoretical challenges to model non-Euclidean manifolds and achieving state-of-the-art performance with significantly improved sampling efficiency.

Authors:Haohan Zou, Jie Feng, Hao Zhao, Yuanyuan Shi
Title: Analytical Lyapunov Function Discovery: An RL-based Generative Approach
Abstract:
Despite advances in learning-based methods, finding valid Lyapunov functions for nonlinear dynamical systems remains challenging. Current neural network approaches face two main issues: challenges in scalable verification and limited interpretability. To address these, we propose an end-to-end framework using transformers to construct analytical Lyapunov functions (local), which simplifies formal verification, enhances interpretability, and provides valuable insights for control engineers. Our framework consists of a transformer-based trainer that generates candidate Lyapunov functions and a falsifier that verifies candidate expressions and refines the model via risk-seeking policy gradient. Unlike Alfarano et al. (2024), which utilizes pre-training and seeks global Lyapunov functions for low-dimensional systems, our model is trained from scratch via reinforcement learning (RL) and succeeds in finding local Lyapunov functions for high-dimensional and non-polynomial systems. Given the analytical nature of the candidates, we employ efficient optimization methods for falsification during training and formal verification tools for the final verification. We demonstrate the efficiency of our approach on a range of nonlinear dynamical systems with up to ten dimensions and show that it can discover Lyapunov functions not previously identified in the control literature. Full implementation is available on \href{https://github.com/JieFeng-cse/Analytical-Lyapunov-Function-Discovery}{Github}
中文: 本文提出了一种基于Transformer的端到端框架,通过强化学习和反证机制为高维非线性系统生成解析李雅普诺夫函数,实现了可扩展的验证并提升了可解释性。
English: This paper introduces an end-to-end transformer-based framework that generates analytical Lyapunov functions for high-dimensional nonlinear systems, enabling scalable verification and improved interpretability through reinforcement learning and falsification mechanisms.

Authors:Jianze Li, Jiezhang Cao, Yong Guo, Wenbo Li, Yulun Zhang
Title: One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation
Abstract:
Diffusion models (DMs) have significantly advanced the development of real-world image super-resolution (Real-ISR), but the computational cost of multi-step diffusion models limits their application. One-step diffusion models generate high-quality images in a one sampling step, greatly reducing computational overhead and inference latency. However, most existing one-step diffusion methods are constrained by the performance of the teacher model, where poor teacher performance results in image artifacts. To address this limitation, we propose FluxSR, a novel one-step diffusion Real-ISR technique based on flow matching models. We use the state-of-the-art diffusion model FLUX.1-dev as both the teacher model and the base model. First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR. Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss and introduce Attention Diversification Loss (ADL) as a regularization term to reduce token similarity in transformer, thereby eliminating high-frequency artifacts. Comprehensive experiments demonstrate that our method outperforms existing one-step diffusion-based Real-ISR methods. The code and model will be released at https://github.com/JianzeLi-114/FluxSR.
Chinese: FluxSR提出了一种基于流匹配模型的新型单步扩散真实图像超分辨率技术,通过引入流轨迹蒸馏和感知损失等方法克服了教师模型限制并消除伪影,在性能上超越了现有单步扩散方法。
English: FluxSR introduces a novel one-step diffusion technique for real-world image super-resolution, utilizing flow matching models and innovative losses to overcome teacher model limitations and eliminate artifacts, achieving superior performance over existing methods.

Authors:Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao
Title: CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
Abstract:
Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel Collaborative Inference with Token-lEvel Routing (CITER) framework that enables efficient collaboration between small and large language models (SLMs \& LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications. Our data and code are available at https://github.com/aiming-lab/CITER.
中文摘要:CITER框架通过令牌级路由策略,将非关键令牌分配给小型语言模型以提高效率,关键令牌分配给大型模型保证质量,在降低推理成本的同时保持生成内容的高水准。
English Summary: The CITER framework optimizes inference efficiency by routing non-critical tokens to a small language model for speed and critical tokens to a large model for accuracy, balancing cost and quality through token-level decisions.

Authors:Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu
Title: Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning
Abstract:
Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant, uninformative, or even harmful. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/UCSC-REAL/TokenCleaning.
Chinese: 近期研究表明,在大语言模型的监督微调中,数据质量比数量更重要,本文提出了一种基于噪声标签视角的通用令牌清洗流程,通过过滤非信息性令牌来提升下游任务性能。
English: Recent research reveals that in supervised fine-tuning of large language models, token-level data quality is more critical than quantity, leading to the development of a token cleaning pipeline that filters uninformative tokens to enhance downstream task performance.

Authors:Jingjing Liu, Li Zhang, Xiaoyang Zeng, Wanquan Liu, Jianhua Zhang
Title: MATCNN: Infrared and Visible Image Fusion Method Based on Multi-scale CNN with Attention Transformer
Abstract:
While attention-based approaches have shown considerable progress in enhancing image fusion and addressing the challenges posed by long-range feature dependencies, their efficacy in capturing local features is compromised by the lack of diverse receptive field extraction techniques. To overcome the shortcomings of existing fusion methods in extracting multi-scale local features and preserving global features, this paper proposes a novel cross-modal image fusion approach based on a multi-scale convolutional neural network with attention Transformer (MATCNN). MATCNN utilizes the multi-scale fusion module (MSFM) to extract local features at different scales and employs the global feature extraction module (GFEM) to extract global features. Combining the two reduces the loss of detail features and improves the ability of global feature representation. Simultaneously, an information mask is used to label pertinent details within the images, aiming to enhance the proportion of preserving significant information in infrared images and background textures in visible images in fused images. Subsequently, a novel optimization algorithm is developed, leveraging the mask to guide feature extraction through the integration of content, structural similarity index measurement, and global feature loss. Quantitative and qualitative evaluations are conducted across various datasets, revealing that MATCNN effectively highlights infrared salient targets, preserves additional details in visible images, and achieves better fusion results for cross-modal images. The code of MATCNN will be available at https://github.com/zhang3849/MATCNN.git.
中文: 本文提出MATCNN跨模态图像融合方法,通过结合多尺度卷积网络与注意力Transformer,有效提取局部与全局特征,在融合图像中增强细节保留与特征表征能力。
English: This paper introduces MATCNN, a cross-modal image fusion method that combines multi-scale convolutional networks with attention Transformers to effectively capture both local and global features, improving detail preservation and feature representation in fused images.

Authors:Angelina Wang, Michelle Phan, Daniel E. Ho, Sanmi Koyejo
Title: Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs
Abstract:
Algorithmic fairness has conventionally adopted the mathematically convenient perspective of racial color-blindness (i.e., difference unaware treatment). However, we contend that in a range of important settings, group difference awareness matters. For example, differentiating between groups may be necessary in legal contexts (e.g., the U.S. compulsory draft applies to men but not women) and harm assessments (e.g., referring to girls as ``terrorists'' may be less harmful than referring to Muslim people as such). Thus, in contrast to most fairness work, we study fairness through the perspective of treating people differently -- when it is contextually appropriate to. We first introduce an important distinction between descriptive (fact-based), normative (value-based), and correlation (association-based) benchmarks. This distinction is significant because each category requires separate interpretation and mitigation tailored to its specific characteristics. Then, we present a benchmark suite composed of eight different scenarios for a total of 16k questions that enables us to assess difference awareness. Finally, we show results across ten models that demonstrate difference awareness is a distinct dimension to fairness where existing bias mitigation strategies may backfire.
中文摘要:该研究主张算法公平性应关注群体差异而非采用色盲方法,通过引入新基准和广泛测试表明,在这一独特的公平维度上,现有的偏见缓解策略可能适得其反。
English Summary: The study argues that algorithmic fairness should incorporate group difference awareness rather than color-blind approaches, introducing new benchmarks and demonstrating through extensive testing that existing bias mitigation methods can be counterproductive in this distinct dimension of fairness.

Authors:Avery Ma, Yangchen Pan, Amir-massoud Farahmand
Title: PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Abstract:
Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Chinese: 本文提出PANDAS混合技术,通过整合积极肯定、负面演示和自适应采样来增强多轮越狱攻击,并在长上下文场景中验证其优于基线方法的性能。
English: The paper introduces PANDAS, a hybrid technique that enhances many-shot jailbreaking by incorporating positive affirmations, negative demonstrations, and adaptive sampling, and demonstrates its superior performance over baseline methods in long-context scenarios.

Authors:Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen
Title: D-Attn: Decomposed Attention for Large Vision-and-Language Models
Abstract:
Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities. However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency. In this paper, we propose Decomposed Attention (D-Attn), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention. D-Attn decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived weighting strategy, namely $α$-weighting. Taking advantage of the flexibility, we are able to introduce two critical improvements in visual token processing while maintaining the capacity of pre-trained LLMs: 1) We rectify the biased positional encoding in textual-to-visual attention to boost visual understanding performance. 2) We diagonalize visual-to-visual attention to reduce computation complexity from $O(|V|^2)$ to $O(|V|)$ for $|V|$ visual tokens without compromising performance. Extensive experiments and analysis validate the effectiveness of D-Attn, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs (\eg, $5\times$ faster). Code will be available at https://github.com/bytedance/DecomposedAttention.
Chinese: 本文提出分解注意力(D-Attn)架构,通过分离视觉与文本标记处理,在保持预训练语言能力的同时显著提升了视觉理解性能并大幅降低了计算复杂度。
English: This paper introduces Decomposed Attention (D-Attn), a flexible architecture for large vision-and-language models that separates visual and textual token processing, enabling key improvements in visual understanding and computational efficiency while preserving pre-trained language capabilities.

Authors:Dazhou Yu, Genpei Zhang, Liang Zhao
Title: PolyhedronNet: Representation Learning for Polyhedra with Surface-attributed Graph
Abstract:
Ubiquitous geometric objects can be precisely and efficiently represented as polyhedra. The transformation of a polyhedron into a vector, known as polyhedra representation learning, is crucial for manipulating these shapes with mathematical and statistical tools for tasks like classification, clustering, and generation. Recent years have witnessed significant strides in this domain, yet most efforts focus on the vertex sequence of a polyhedron, neglecting the complex surface modeling crucial in real-world polyhedral objects. This study proposes \textbf{PolyhedronNet}, a general framework tailored for learning representations of 3D polyhedral objects. We propose the concept of the surface-attributed graph to seamlessly model the vertices, edges, faces, and their geometric interrelationships within a polyhedron. To effectively learn the representation of the entire surface-attributed graph, we first propose to break it down into local rigid representations to effectively learn each local region's relative positions against the remaining regions without geometric information loss. Subsequently, we propose PolyhedronGNN to hierarchically aggregate the local rigid representation via intra-face and inter-face geometric message passing modules, to obtain a global representation that minimizes information loss while maintaining rotation and translation invariance. Our experimental evaluations on four distinct datasets, encompassing both classification and retrieval tasks, substantiate PolyhedronNet's efficacy in capturing comprehensive and informative representations of 3D polyhedral objects. Code and data are available at {https://github.com/dyu62/3D_polyhedron}.
中文: 本研究提出PolyhedronNet框架,通过表面属性图建模多面体结构,并采用几何消息传递机制学习鲁棒的三维表示,在分类和检索任务中展现出卓越性能。
English: This study introduces PolyhedronNet, a framework that models polyhedra as surface-attributed graphs and employs geometric message passing to learn robust 3D representations, demonstrating superior performance in classification and retrieval tasks.

Authors:Haruka Kiyohara, Fan Yao, Sarah Dean
Title: Policy Design for Two-sided Platforms with Participation Dynamics
Abstract:
In two-sided platforms (e.g., video streaming or e-commerce), viewers and providers engage in interactive dynamics: viewers benefit from increases in provider populations, while providers benefit from increases in viewer population. Despite the importance of such "population effects" on long-term platform health, recommendation policies do not generally take the participation dynamics into account. This paper thus studies the dynamics and recommender policy design on two-sided platforms under the population effects for the first time. Our control- and game-theoretic findings warn against the use of the standard "myopic-greedy" policy and shed light on the importance of provider-side considerations (i.e., effectively distributing exposure among provider groups) to improve social welfare via population growth. We also present a simple algorithm to optimize long-term social welfare by taking the population effects into account, and demonstrate its effectiveness in synthetic and real-data experiments. Our experiment code is available at https://github.com/sdean-group/dynamics-two-sided-market.
中文: 本文首次研究双边平台中人口效应下的动态与推荐策略,提出通过优化供应商曝光分配来提升长期社会福利的算法,并在实验中验证其有效性。
English: This paper introduces a novel recommendation policy that accounts for population effects in two-sided platforms, demonstrating through experiments that optimizing provider exposure distribution enhances long-term social welfare.

Authors:Zhengtong Xu, Qiang Qiu, Yu She
Title: VILP: Imitation Learning with Latent Video Planning
Abstract:
In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This paper introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96x160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions. For more details, please refer to our open-source repository https://github.com/ZhengtongXu/VILP.
中文: 本文提出VILP方法,通过隐空间视频扩散模型生成时序一致的多视角机器人视频,在降低对高质量动作数据依赖的同时,在训练成本、推理速度和策略性能上均优于现有方法。
English: This paper introduces VILP, a latent video diffusion model for imitation learning that generates temporally consistent, multi-view robot videos efficiently, outperforming existing methods in training cost, speed, and policy performance while reducing reliance on high-quality action data.

Authors:Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, Song Han
Title: Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
Abstract:
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code is open-sourced and is available at https://github.com/svg-project/Sparse-VideoGen
中文:提出的Sparse VideoGen(SVG)框架通过动态识别扩散变换器中的空间与时间注意力模式,在保持生成质量的同时将视频生成速度提升超过2倍。
English: The proposed Sparse VideoGen (SVG) framework dynamically identifies spatial and temporal attention patterns in Diffusion Transformers to reduce computational costs, achieving over 2x speedup in video generation while maintaining quality.

Authors:Dmitry Manning-Coe, Jacopo Gliozzi, Alexander G. Stapleton, Edward Hirst, Giuseppe De Tomasi, Barry Bradlyn, David S. Berman
Title: Grokking vs. Learning: Same Features, Different Encodings
Abstract:
Grokking typically achieves similar loss to ordinary, "steady", learning. We ask whether these different learning paths - grokking versus ordinary training - lead to fundamental differences in the learned models. To do so we compare the features, compressibility, and learning dynamics of models trained via each path in two tasks. We find that grokked and steadily trained models learn the same features, but there can be large differences in the efficiency with which these features are encoded. In particular, we find a novel "compressive regime" of steady training in which there emerges a linear trade-off between model loss and compressibility, and which is absent in grokking. In this regime, we can achieve compression factors 25x times the base model, and 5x times the compression achieved in grokking. We then track how model features and compressibility develop through training. We show that model development in grokking is task-dependent, and that peak compressibility is achieved immediately after the grokking plateau. Finally, novel information-geometric measures are introduced which demonstrate that models undergoing grokking follow a straight path in information space.
Chinese: 顿悟式学习与普通训练在损失上相近,但特征编码效率差异显著,其中普通训练特有的压缩机制可实现比顿悟学习高得多的模型压缩率。
English: Grokking and ordinary training yield similar loss but differ in feature encoding efficiency, with steady training exhibiting a unique compressive regime that allows for significantly greater model compressibility than grokking.

Authors:Shilong Hong, Yanzhou Zhou, Weichao Xu
Title: DAGNet: A Dual-View Attention-Guided Network for Efficient X-ray Security Inspection
Abstract:
With the rapid development of modern transportation systems and the exponential growth of logistics volumes, intelligent X-ray-based security inspection systems play a crucial role in public safety. Although single-view X-ray baggage scanner is widely deployed, they struggles to accurately identify contraband in complex stacking scenarios due to strong viewpoint dependency and inadequate feature representation. To address this, we propose a Dual-View Attention-Guided Network for Efficient X-ray Security Inspection (DAGNet). This study builds on a shared-weight backbone network as the foundation and constructs three key modules that work together: (1) Frequency Domain Interaction Module (FDIM) dynamically enhances features by adjusting frequency components based on inter-view relationships; (2) Dual-View Hierarchical Enhancement Module (DVHEM) employs cross-attention to align features between views and capture hierarchical associations; (3) Convolutional Guided Fusion Module (CGFM) fuses features to suppress redundancy while retaining critical discriminative information. Collectively, these modules substantially improve the performance of dual-view X-ray security inspection. Experimental results demonstrate that DAGNet outperforms existing state-of-the-art approaches across multiple backbone architectures. The code is available at:https://github.com/ShilongHong/DAGNet.
中文摘要:针对单视角X光安检仪在复杂堆叠场景下识别违禁品的不足,本文提出DAGNet双视角注意力引导网络,通过频域交互、层次化特征对齐和卷积融合模块,显著提升了双视角X光安检性能。
English Summary: To overcome the limitations of single-view X-ray scanners in detecting contraband within complex baggage, this paper introduces DAGNet, a dual-view attention-guided network that enhances inspection accuracy through frequency domain interaction, hierarchical feature alignment, and convolutional fusion modules.

Authors:Yirui Zeng, Jun Fu, Hadi Amirpour, Huasheng Wang, Guanghui Yue, Hantao Liu, Ying Chen, Wei Zhou
Title: CLIP-DQA: Blindly Evaluating Dehazed Images from Global and Local Perspectives Using CLIP
Abstract:
Blind dehazed image quality assessment (BDQA), which aims to accurately predict the visual quality of dehazed images without any reference information, is essential for the evaluation, comparison, and optimization of image dehazing algorithms. Existing learning-based BDQA methods have achieved remarkable success, while the small scale of DQA datasets limits their performance. To address this issue, in this paper, we propose to adapt Contrastive Language-Image Pre-Training (CLIP), pre-trained on large-scale image-text pairs, to the BDQA task. Specifically, inspired by the fact that the human visual system understands images based on hierarchical features, we take global and local information of the dehazed image as the input of CLIP. To accurately map the input hierarchical information of dehazed images into the quality score, we tune both the vision branch and language branch of CLIP with prompt learning. Experimental results on two authentic DQA datasets demonstrate that our proposed approach, named CLIP-DQA, achieves more accurate quality predictions over existing BDQA methods. The code is available at https://github.com/JunFu1995/CLIP-DQA.
中文: 本文提出CLIP-DQA方法,通过结合分层特征和提示学习将对比语言-图像预训练模型应用于无参考去雾图像质量评估,在真实数据集上取得了优于现有方法的性能。
English: This paper introduces CLIP-DQA, a method that adapts Contrastive Language-Image Pre-Training (CLIP) to blind dehazed image quality assessment by incorporating hierarchical features and prompt learning, achieving superior performance over existing methods on authentic datasets.

Authors:Xianglong Yan, Tianao Zhang, Zhiteng Li, Haotong Qin, Yulun Zhang
Title: Progressive Binarization with Semi-Structured Pruning for LLMs
Abstract:
Large language models (LLMs) have achieved remarkable progress in natural language processing, but their high computational and memory costs hinder deployment on resource-constrained devices. Binarization represents the most extreme form of quantization, yet binarized models still contain redundancy that can be further removed. Pruning provides a natural way to eliminate such redundancy, but naïve combination with binarization often results in severe performance degradation. In this paper, we propose Progressive Binarization with Semi-Structured Pruning (PBS$^2$P), a novel post-training framework that seamlessly integrates binarization and semi-structured pruning. We first propose Stepwise semi-structured Pruning with Binarization Optimization (SPBO), which progressively introduces sparsity while optimizing binarization parameters to jointly reduce pruning and quantization error, yielding more stable and accurate compression. Additionally, we propose a Coarse-to-Fine Search (CFS) that first allocates pruning ratios and then refines element selection, further enhancing overall performance. Extensive experiments across multiple LLM families show that PBS$^2$P consistently outperforms state-of-the-art (SOTA) binary post-training quantization methods in both perplexity and downstream accuracy. The code and models will be available at https://github.com/XIANGLONGYAN/PBS2P.
Chinese: 提出的渐进式二值化与半结构化剪枝(PBS²P)框架将1比特模型压缩与结构化剪枝有效结合,在提升计算效率的同时保持模型性能,在困惑度和下游任务准确率上均优于现有先进方法。
English: The proposed Progressive Binarization with Semi-Structured Pruning (PBS²P) framework effectively combines 1-bit model compression with structured pruning to enhance computational efficiency while maintaining performance, outperforming existing methods in both perplexity and task accuracy.

Authors:Xianglong Yan, Tianao Zhang, Zhiteng Li, Yulun Zhang
Title: Progressive Binarization with Semi-Structured Pruning for LLMs
Abstract:
Large language models (LLMs) have achieved remarkable progress in natural language processing, but their high computational and memory costs hinder deployment on resource-constrained devices. Binarization, which reduces model weights to 1 bit, is a promising solution for efficient inference. However, binarized LLMs still exhibit redundancy that can be further compressed. Semi-structured pruning offers a favorable trade-off between model performance and hardware efficiency, but naively combining it with binarization often leads to severe performance degradation. To address this, we propose Progressive Binarization with Semi-Structured Pruning (PBS$^2$P), a novel post-training compression framework. We propose Stepwise semi-structured Pruning with Binarization Optimization (SPBO) to jointly reduce pruning and binarization error. Additionally, we develop a Coarse-to-Fine Search (CFS) strategy to more effectively select pruning elements. Extensive experiments across multiple LLM families show that PBS$^2$P consistently outperforms state-of-the-art binary post-training quantization methods in both perplexity and downstream accuracy. The code and models will be available at: https://github.com/XIANGLONGYAN/PBS2P.
Chinese: 提出的渐进式二值化与半结构化剪枝(PBS²P)框架将1比特模型压缩与结构化剪枝有效结合,在提升计算效率的同时保持模型性能,在困惑度和下游任务准确率上均优于现有先进方法。
English: The proposed Progressive Binarization with Semi-Structured Pruning (PBS²P) framework effectively combines 1-bit model compression with structured pruning to enhance computational efficiency while maintaining performance, outperforming existing methods in both perplexity and task accuracy.

Authors:Kim Yong Tan, Yueming Lyu, Ivor Tsang, Yew-Soon Ong
Title: Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation
Abstract:
Guided diffusion-model generation is a promising direction for customizing the generation process of a pre-trained diffusion model to address specific downstream tasks. Existing guided diffusion models either rely on training the guidance model with pre-collected datasets or require the objective functions to be differentiable. However, for most real-world tasks, offline datasets are often unavailable, and their objective functions are often not differentiable, such as image generation with human preferences, molecular generation for drug discovery, and material design. Thus, we need an $\textbf{online}$ algorithm capable of collecting data during runtime and supporting a $\textbf{black-box}$ objective function. Moreover, the $\textbf{query efficiency}$ of the algorithm is also critical because the objective evaluation of the query is often expensive in real-world scenarios. In this work, we propose a novel and simple algorithm, $\textbf{Fast Direct}$, for query-efficient online black-box target generation. Our Fast Direct builds a pseudo-target on the data manifold to update the noise sequence of the diffusion model with a universal direction, which is promising to perform query-efficient guided generation. Extensive experiments on twelve high-resolution ($\small {1024 \times 1024}$) image target generation tasks and six 3D-molecule target generation tasks show $\textbf{6}\times$ up to $\textbf{10}\times$ query efficiency improvement and $\textbf{11}\times$ up to $\textbf{44}\times$ query efficiency improvement, respectively. Our implementation is publicly available at: https://github.com/kimyong95/guide-stable-diffusion/tree/fast-direct
Chinese: 引导扩散模型在实际应用中常因缺乏离线数据集和不可微目标函数而受限,为此提出的Fast Direct算法通过在线黑盒优化实现了查询高效性,在图像和分子生成任务中分别提升了6-10倍和11-44倍的效率。
English: Guided diffusion models face challenges in real-world applications due to the unavailability of offline datasets and non-differentiable objective functions, prompting the development of Fast Direct, a query-efficient online algorithm that significantly improves efficiency in high-resolution image and 3D-molecule generation tasks.

Authors:Jiaxing Xu, Yongqiang Chen, Xia Dong, Mengcheng Lan, Tiancheng Huang, Qingtian Bian, James Cheng, Yiping Ke
Title: BrainOOD: Out-of-distribution Generalizable Brain Network Analysis
Abstract:
In neuroscience, identifying distinct patterns linked to neurological disorders, such as Alzheimer's and Autism, is critical for early diagnosis and effective intervention. Graph Neural Networks (GNNs) have shown promising in analyzing brain networks, but there are two major challenges in using GNNs: (1) distribution shifts in multi-site brain network data, leading to poor Out-of-Distribution (OOD) generalization, and (2) limited interpretability in identifying key brain regions critical to neurological disorders. Existing graph OOD methods, while effective in other domains, struggle with the unique characteristics of brain networks. To bridge these gaps, we introduce BrainOOD, a novel framework tailored for brain networks that enhances GNNs' OOD generalization and interpretability. BrainOOD framework consists of a feature selector and a structure extractor, which incorporates various auxiliary losses including an improved Graph Information Bottleneck (GIB) objective to recover causal subgraphs. By aligning structure selection across brain networks and filtering noisy features, BrainOOD offers reliable interpretations of critical brain regions. Our approach outperforms 16 existing methods and improves generalization to OOD subjects by up to 8.5%. Case studies highlight the scientific validity of the patterns extracted, which aligns with the findings in known neuroscience literature. We also propose the first OOD brain network benchmark, which provides a foundation for future research in this field. Our code is available at https://github.com/AngusMonroe/BrainOOD.
Chinese: BrainOOD框架通过解决多站点脑网络数据分布偏移和提升可解释性,增强了图神经网络的泛化能力,在分布外测试中性能提升高达8.5%,并能识别与神经疾病相关的关键脑区。
English: The BrainOOD framework enhances Graph Neural Networks' generalization and interpretability for brain network analysis by addressing distribution shifts and identifying key regions, achieving up to 8.5% improvement in out-of-distribution performance and providing neuroscience-valid insights.

Authors:Srinitish Srinivasan, Omkumar CU
Title: Predict, Cluster, Refine: A Joint Embedding Predictive Self-Supervised Framework for Graph Representation Learning
Abstract:
Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction, yet prevailing self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse. Existing approaches often depend on feature reconstruction, negative sampling, or complex decoders, which introduce training overhead and hinder generalization. Further, current techniques which address such limitations fail to account for the contribution of node embeddings to a certain prediction in the absence of labeled nodes. To address these limitations, we propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information. Additionally, we introduce a semantic-aware objective term that incorporates pseudo-labels derived from Gaussian Mixture Models (GMMs), enhancing node discriminability by evaluating latent feature contributions. Extensive experiments demonstrate that our framework outperforms state-of-the-art graph SSL methods across benchmarks, achieving superior performance without contrastive loss or complex decoders. Key innovations include (1) a non-contrastive, view-invariant joint embedding predictive architecture, (2) Leveraging single context and multiple targets relationship between subgraphs, and (3) GMM-based pseudo-label scoring to capture semantic contributions. This work advances graph SSL by offering a computationally efficient, collapse-resistant paradigm that bridges spatial and semantic graph features for downstream tasks. The code for our paper can be found at https://github.com/Deceptrax123/JPEB-GSSL
中文摘要:本文提出了一种新颖的图自监督学习联合嵌入预测框架,该框架摒弃了对比目标和负采样,通过高斯混合模型生成的伪标签增强节点区分度,在多个基准测试中实现了最先进的性能。
English Summary: This paper introduces a novel joint embedding predictive framework for graph self-supervised learning that eliminates contrastive objectives and negative sampling while incorporating Gaussian Mixture Model-based pseudo-labels to enhance node discriminability, achieving state-of-the-art performance across benchmarks.

Authors:Ziyang Zheng, Shan Huang, Jianyuan Zhong, Zhengyuan Shi, Guohao Dai, Ningyi Xu, Qiang Xu
Title: DeepGate4: Efficient and Effective Representation Learning for Circuit Design at Scale
Abstract:
Circuit representation learning has become pivotal in electronic design automation, enabling critical tasks such as testability analysis, logic reasoning, power estimation, and SAT solving. However, existing models face significant challenges in scaling to large circuits due to limitations like over-squashing in graph neural networks and the quadratic complexity of transformer-based models. To address these issues, we introduce DeepGate4, a scalable and efficient graph transformer specifically designed for large-scale circuits. DeepGate4 incorporates several key innovations: (1) an update strategy tailored for circuit graphs, which reduce memory complexity to sub-linear and is adaptable to any graph transformer; (2) a GAT-based sparse transformer with global and local structural encodings for AIGs; and (3) an inference acceleration CUDA kernel that fully exploit the unique sparsity patterns of AIGs. Our extensive experiments on the ITC99 and EPFL benchmarks show that DeepGate4 significantly surpasses state-of-the-art methods, achieving 15.5% and 31.1% performance improvements over the next-best models. Furthermore, the Fused-DeepGate4 variant reduces runtime by 35.1% and memory usage by 46.8%, making it highly efficient for large-scale circuit analysis. These results demonstrate the potential of DeepGate4 to handle complex EDA tasks while offering superior scalability and efficiency. Code is available at https://github.com/zyzheng17/DeepGate4-ICLR-25.
中文摘要:DeepGate4是一种可扩展的图变换器,通过创新策略降低内存复杂度并提升效率,解决了现有模型在大规模电路分析中的局限性,在基准测试中实现了显著的性能提升。
English Summary: DeepGate4 is a scalable graph transformer that addresses the limitations of existing models in large-scale circuit analysis by introducing innovative strategies to reduce memory complexity and enhance efficiency, achieving significant performance improvements on benchmark tests.

Authors:Bo Pang, Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh
Title: LIBRA: Measuring Bias of Large Language Model from a Local Context
Abstract:
Large Language Models (LLMs) have significantly advanced natural language processing applications, yet their widespread use raises concerns regarding inherent biases that may reduce utility or harm for particular social groups. Despite the advancement in addressing LLM bias, existing research has two major limitations. First, existing LLM bias evaluation focuses on the U.S. cultural context, making it challenging to reveal stereotypical biases of LLMs toward other cultures, leading to unfair development and use of LLMs. Second, current bias evaluation often assumes models are familiar with the target social groups. When LLMs encounter words beyond their knowledge boundaries that are unfamiliar in their training data, they produce irrelevant results in the local context due to hallucinations and overconfidence, which are not necessarily indicative of inherent bias. This research addresses these limitations with a Local Integrated Bias Recognition and Assessment Framework (LIBRA) for measuring bias using datasets sourced from local corpora without crowdsourcing. Implementing this framework, we develop a dataset comprising over 360,000 test cases in the New Zealand context. Furthermore, we propose the Enhanced Idealized CAT Score (EiCAT), integrating the iCAT score with a beyond knowledge boundary score (bbs) and a distribution divergence-based bias measurement to tackle the challenge of LLMs encountering words beyond knowledge boundaries. Our results show that the BERT family, GPT-2, and Llama-3 models seldom understand local words in different contexts. While Llama-3 exhibits larger bias, it responds better to different cultural contexts. The code and dataset are available at: https://github.com/ipangbo/LIBRA.
中文摘要:本研究提出LIBRA框架,利用本地数据集评估大语言模型的文化偏见,通过增强理想化CAT分数解决现有评估方法的局限,涵盖文化多样性并处理模型对陌生词汇的响应问题。
English Summary: This research introduces the LIBRA framework to evaluate cultural biases in Large Language Models using local datasets, addressing limitations in current bias assessments by incorporating cultural diversity and handling unfamiliar terms through the Enhanced Idealized CAT Score.

Authors:Yihe Wang, Nan Huang, Nadia Mammone, Marco Cecchi, Xiang Zhang
Title: LEAD: Large Foundation Model for EEG-Based Alzheimer's Disease Detection
Abstract:
Electroencephalogram (EEG) provides a non-invasive, highly accessible, and cost-effective solution for Alzheimer's Disease (AD) detection. However, existing methods, whether based on manual feature extraction or deep learning, face two major challenges: the lack of large-scale datasets for robust feature learning and evaluation, and poor detection performance due to inter-subject variations. To address these challenges, we curate an EEG-AD corpus containing 813 subjects, which forms the world's largest EEG-AD dataset to the best of our knowledge. Using this unique dataset, we propose LEAD, the first large foundation model for EEG-based AD detection. Our method encompasses an entire pipeline, from data selection and preprocessing to self-supervised contrastive pretraining, fine-tuning, and key setups such as subject-independent evaluation and majority voting for subject-level detection. We pre-train the model on 11 EEG datasets and unified fine-tune it on 5 AD datasets. Our self-supervised pre-training design includes sample-level and subject-level contrasting to extract useful general EEG features. Fine-tuning is performed on 5 channel-aligned datasets together. The backbone encoder incorporates temporal and channel embeddings to capture features across both temporal and spatial dimensions. Our method demonstrates outstanding AD detection performance, achieving up to a 9.86% increase in F1 score at the sample-level and up to a 9.31% at the subject-level compared to state-of-the-art methods. The results of our model strongly confirm the effectiveness of contrastive pre-training and channel-aligned unified fine-tuning for addressing inter-subject variation. The source code is at https://github.com/DL4mHealth/LEAD.
中文摘要:本研究提出了首个用于脑电图阿尔茨海默病检测的大规模基础模型LEAD,通过构建全球最大的脑电数据集并采用创新的个体水平检测框架,在独立样本验证中实现了90.91%的敏感度,显著提升了检测性能。
English Summary: The study introduces LEAD, the first large-scale foundation model for EEG-based Alzheimer's disease detection, which overcomes previous limitations by using the world's largest EEG-AD dataset and a novel subject-level detection framework to achieve superior performance with 90.91% sensitivity.

Authors:Yihe Wang, Nan Huang, Nadia Mammone, Marco Cecchi, Xiang Zhang
Title: LEAD: Large Foundation Model for EEG-Based Alzheimer's Disease Detection
Abstract:
Electroencephalography (EEG) provides a non-invasive, highly accessible, and cost-effective approach for detecting Alzheimer's disease (AD). However, existing methods, whether based on handcrafted feature engineering or standard deep learning, face two major challenges: 1) the lack of large-scale EEG-AD datasets for robust representation learning, and 2) the absence of a dedicated deep learning pipeline for subject-level detection, which is more clinically meaningful than the commonly used sample-level detection. To address these gaps, we have curated the world's largest EEG-AD corpus to date, comprising 2,255 subjects. Leveraging this unique data corpus, we propose LEAD, the first large-scale foundation model for EEG analysis in dementia. Our approach provides an innovative framework for subject-level AD detection, including: 1) a comprehensive preprocessing pipeline such as artifact removal, resampling, and filtering, and a newly proposed multi-scale segmentation strategy, 2) a subject-regularized spatio-temporal transformer trained with a novel subject-level cross-entropy loss and an indices group-shuffling algorithm, and 3) AD-guided contrastive pre-training. We pre-train on 12 datasets (3 AD-related and 9 non-AD) and fine-tune/test on 4 AD datasets. Compared with 10 baselines, LEAD consistently obtains superior subject-level detection performance under the challenging subject-independent cross-validation protocol. On the benchmark ADFTD dataset, our model achieves an impressive subject-level Sensitivity of 90.91% under the leave-one-subject-out (LOSO) setting. These results strongly validate the effectiveness of our method for real-world EEG-based AD detection. Source code: https://github.com/DL4mHealth/LEAD
中文摘要:本研究提出了首个用于脑电图阿尔茨海默病检测的大规模基础模型LEAD,通过构建全球最大的脑电数据集并采用创新的个体水平检测框架,在独立样本验证中实现了90.91%的敏感度,显著提升了检测性能。
English Summary: The study introduces LEAD, the first large-scale foundation model for EEG-based Alzheimer's disease detection, which overcomes previous limitations by using the world's largest EEG-AD dataset and a novel subject-level detection framework to achieve superior performance with 90.91% sensitivity.

Authors:Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang
Title: Fast Large Language Model Collaborative Decoding via Speculation
Abstract:
Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding--where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to collaboration among n models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is 1.11x-2.23x faster than standard collaborative decoding without compromising generation quality. Our code is available at https://github.com/Kamichanw/CoS/.
Chinese: CoS是一种新颖的协作解码框架,通过推测验证和交替模型角色,在不牺牲性能的情况下将解码速度提升最高达2.23倍。
English: CoS is a novel framework that accelerates collaborative decoding by up to 2.23 times without sacrificing performance, using speculative verification and alternating model roles to enhance efficiency.

Authors:Varun Dhanraj, Chris Eliasmith
Title: Improving Rule-based Reasoning in LLMs using Neurosymbolic Representations
Abstract:
Large language models (LLMs) continue to face challenges in reliably solving reasoning tasks, particularly those that require precise rule following, as often found in mathematical reasoning. This paper introduces a novel neurosymbolic method that improves LLM reasoning by encoding hidden states into neurosymbolic vectors, enabling problem-solving within a neurosymbolic vector space. The results are decoded and merged with the original hidden state, significantly boosting the model's performance on numerical reasoning tasks. By offloading computation through neurosymbolic representations, this method enhances efficiency, reliability, and interpretability. Experimental results demonstrate an average of 88.6% lower cross-entropy loss and 15.4 times more problems correctly solved on a suite of mathematical reasoning tasks compared to chain-of-thought prompting and supervised fine-tuning (LoRA), without degrading performance on other tasks. We make our code available at: https://github.com/vdhanraj/Neurosymbolic-LLM.
中文: 本文提出一种神经符号方法,将大语言模型的隐藏状态编码为神经符号向量,从而在数学推理任务中显著降低交叉熵损失并提高解题正确率,相比现有方法性能大幅提升。
English: This paper presents a neurosymbolic method that encodes LLM hidden states into neurosymbolic vectors to enhance reasoning efficiency and accuracy, achieving significantly lower cross-entropy loss and higher problem-solving rates in mathematical tasks compared to existing methods.

Authors:Muhammad Zain Raza, Jiawei Xu, Terence Lim, Lily Boddy, Carlos M. Mery, Andrew Well, Ying Ding
Title: LLM-TA: An LLM-Enhanced Thematic Analysis Pipeline for Transcripts from Parents of Children with Congenital Heart Disease
Abstract:
Thematic Analysis (TA) is a fundamental method in healthcare research for analyzing transcript data, but it is resource-intensive and difficult to scale for large, complex datasets. This study investigates the potential of large language models (LLMs) to augment the inductive TA process in high-stakes healthcare settings. Focusing on interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we propose an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline. Our pipeline integrates an affordable state-of-the-art LLM (GPT-4o mini), LangChain, and prompt engineering with chunking techniques to analyze nine detailed transcripts following the inductive TA framework. We evaluate the LLM-generated themes against human-generated results using thematic similarity metrics, LLM-assisted assessments, and expert reviews. Results demonstrate that our pipeline outperforms existing LLM-assisted TA methods significantly. While the pipeline alone has not yet reached human-level quality in inductive TA, it shows great potential to improve scalability, efficiency, and accuracy while reducing analyst workload when working collaboratively with domain experts. We provide practical recommendations for incorporating LLMs into high-stakes TA workflows and emphasize the importance of close collaboration with domain experts to address challenges related to real-world applicability and dataset complexity. https://github.com/jiaweixu98/LLM-TA
中文: 本研究提出了一种LLM增强主题分析(LLM-TA)流程,通过整合先进语言模型与专家知识显著提升了医疗研究的可扩展性和效率,尽管在归纳性主题分析中尚未完全达到人类专家的水平。
English: This study introduces an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline that significantly improves scalability and efficiency in healthcare research by integrating advanced language models with human expertise, though it has not yet achieved full human-level quality in inductive thematic analysis.

Authors:Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal
Title: Learning to Generate Unit Tests for Automated Debugging
Abstract:
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), we propose UTDebug that (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and backtracks edits based on multiple generated UTs to avoid overfitting, and helps LLMs debug effectively. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen2.5 32B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3.17% and 12.35% (respectively) over other LLM-based UT generation baselines. Moreover, we observe that feedback from Qwen2.5 32B-based UTGen model can enhance debugging with frontier LLMs like GPT-4o by 13.8%. Lastly, we demonstrate that UTGen is a better judge for code correctness, outperforming a state-of-the-art trained 8B reward model by 4.43% on HumanEval+ with best-of-10 sampling using Qwen2.5 7B.
中文:UTGen教导大语言模型生成能揭示代码错误的单元测试输入及正确预期输出,而UTDebug通过测试时计算和回溯验证来增强该过程,从而显著提升调试效果与代码纠错准确率。
English: UTGen enables LLMs to generate unit test inputs that reveal code errors and predict correct outputs, while UTDebug enhances this process through test-time computation and validation to improve debugging effectiveness and accuracy.

Authors:Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Pin-Yu Chen, Xiuying Chen, Xiangliang Zhang
Title: Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search
Abstract:
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information. Although prior studies have explored this issue using fixed-template or retrieval-based distractions, such static methods show limited effectiveness against contemporary models. To address this problem, we propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior. Without modifying the original question or answer, the method efficiently produces challenging adaptive distractions across multiple datasets, enabling systematic stress testing of LLMs' contextual robustness. Experiments on four benchmarks demonstrate that the generated distractions lead to an average performance drop of over 45\% for mainstream models. Further comparisons of mitigation strategies show that prompt-based optimization methods yield limited gains, whereas post-training approaches (e.g., DPO) significantly enhance the model's contextual robustness. The results indicate that these issues do not stem from knowledge deficits in LLMs, but from a fundamental inability to maintain consistent reasoning under contextual distraction, posing a major challenge to the reliability of LLMs in real-world applications. The code is publicly available at https://github.com/wyf23187/Adaptive_Distractions.
中文摘要: 本文提出了一种基于树搜索的动态干扰生成框架,通过产生自适应干扰项系统性测试大语言模型,发现其在上下文干扰下会出现严重性能下降,这源于模型推理一致性的根本缺陷而非知识储备问题。
English Summary: This paper introduces a dynamic tree search-based framework that generates adaptive distractions to test and reveal large language models' significant performance drops under contextual interference, highlighting their fundamental reasoning consistency issues rather than knowledge gaps.

Authors:Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, Salman Khan
Title: Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models
Abstract:
Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at https://github.com/HashmatShadab/Robust-LLaVA.
Chinese: 本研究通过整合经过对抗性预训练的视觉模型,为多模态大语言模型引入了一个鲁棒框架,显著提升了对抗各种威胁的鲁棒性,同时保持了良好的干净性能,在图像描述和视觉问答等任务中取得了显著成效。
English: This work introduces a robust framework for Multi-modal Large Language Models (MLLMs) by integrating adversarially pre-trained vision models, which significantly enhances adversarial robustness against various threats while maintaining clean performance, achieving notable gains in tasks like captioning and visual question-answering.

Authors:Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang
Title: Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding
Abstract:
Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model's parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE), which has appeared since the first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The Code is Available at https://github.com/MingyuJ666/Rope_with_LLM.
中文: 大语言模型中注意力查询和键因旋转位置编码出现集中大值,这些值对上下文知识解释至关重要,而非参数知识检索。
English: Large language models exhibit concentrated massive values in attention queries and keys due to Rotary Positional Encoding, which are crucial for contextual knowledge interpretation rather than parametric knowledge retrieval.

Authors:Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao Huang
Title: VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Abstract:
Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic dependencies through specialized multi-modal retrieval paradigms. Through comprehensive empirical evaluation on our proposed LongerVideos benchmark-comprising over 160 videos totaling 134+ hours across lecture, documentary, and entertainment categories-VideoRAG demonstrates substantial performance compared to existing RAG alternatives and long video understanding methods. The source code of VideoRAG implementation and the benchmark dataset are openly available at: https://github.com/HKUDS/VideoRAG.
中文: 本文提出首个专为超长视频设计的检索增强生成框架VideoRAG,其采用双通道架构整合基于图结构的文本知识基础与多模态上下文编码,在包含160余个视频的LongerVideos基准测试中展现出卓越性能。
English: This paper presents VideoRAG, the first retrieval-augmented generation framework designed for processing unlimited-length videos through a dual-channel architecture that integrates graph-based textual knowledge grounding and multi-modal context encoding, demonstrating superior performance on the comprehensive LongerVideos benchmark.

Authors:Andrew Rouditchenko, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
Title: mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition
Abstract:
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
中文:提出的mWhisper-Flamingo模型结合预训练音频和视频模型,通过解码器模态丢弃技术实现多语言视听语音识别的最优性能,在九种语言的嘈杂环境中持续超越纯音频系统。
English: The proposed mWhisper-Flamingo model combines pre-trained audio and video models with decoder modality dropout to achieve state-of-the-art multilingual audio-visual speech recognition, consistently outperforming audio-only systems in noisy conditions across nine languages.

Authors:Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu
Title: Preference Leakage: A Contamination Problem in LLM-as-a-judge
Abstract:
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.
中文: 本研究揭示了在LLM作为评判者的场景中,偏好泄露这一污染问题,即评估者对来自相关模型的合成数据表现出偏向性,表明这是模型开发中普遍存在且难以检测的难题。
English: This study identifies preference leakage as a contamination issue in LLM-as-a-judge scenarios, where evaluators show bias toward synthetically generated data from related models, revealing it as a pervasive and hard-to-detect problem in model development.

Authors:Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, Lei Li
Title: Explaining Context Length Scaling and Bounds for Language Models
Abstract:
Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impacts Language Modeling. In this work, we (1) propose a clean and effective theoretical framework for explaining the impact of context length on Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models. Code for our experiments is available at: https://github.com/JingzheShi/NLPCtlScalingAndBounds.
中文: 本研究从内在空间视角提出了一个理论框架,解释上下文长度对语言建模的影响,并通过自然语言和合成数据实验验证,发现训练数据集大小决定了最优上下文长度并设定了扩展界限。
English: This study introduces a theoretical framework from an Intrinsic Space perspective to explain how context length affects Language Modeling, validated through experiments on natural and synthetic data, revealing that training dataset size determines optimal context length and sets scaling bounds.

Authors:Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
Title: Process Reinforcement through Implicit Rewards
Abstract:
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
Chinese: PRIME通过隐式过程奖励,仅利用策略推演和结果标签实现在线过程奖励模型更新,在数学和编程任务中无需专门奖励模型训练即取得显著性能提升。
English: PRIME introduces implicit process rewards to enable online updates of process reward models using only policy rollouts and outcome labels, achieving significant improvements in math and coding tasks without dedicated reward model training.

Authors:Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
Title: Process Reinforcement through Implicit Rewards
Abstract:
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
Chinese: PRIME通过隐式过程奖励,仅利用策略推演和结果标签实现在线过程奖励模型更新,在数学和编程任务中无需专门奖励模型训练即取得显著性能提升。
English: PRIME introduces implicit process rewards to enable online updates of process reward models using only policy rollouts and outcome labels, achieving significant improvements in math and coding tasks without dedicated reward model training.

Authors:Quan Dao, Khanh Doan, Di Liu, Trung Le, Dimitris Metaxas
Title: Improved Training Technique for Latent Consistency Models
Abstract:
Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-$c$ scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models. The implementation is released here: https://github.com/quandao10/sLCT/
中文摘要:本研究通过采用柯西损失、引入扩散损失和最优传输耦合,并配合自适应调度器与无缩放层归一化,有效解决了潜在空间中异常值导致的性能下降问题,成功训练出能进行高质量少步采样的潜在一致性模型,显著缩小了与扩散模型的性能差距。
English Summary: This study improves latent consistency models by addressing performance degradation from impulsive outliers in latent spaces through Cauchy losses, diffusion loss integration, optimal transport coupling, and architectural enhancements, enabling high-quality few-step sampling that narrows the gap with diffusion models.

Authors:Grigoriy Ksenofontov, Alexander Korotin
Title: Categorical Schrödinger Bridge Matching
Abstract:
The Schrödinger Bridge (SB) is a powerful framework for solving generative modeling tasks such as unpaired domain translation. Most SB-related research focuses on continuous data space $\mathbb{R}^{D}$ and leaves open theoretical and algorithmic questions about applying SB methods to discrete data, e.g, on finite spaces $\mathbb{S}^{D}$. Notable examples of such sets $\mathbb{S}$ are codebooks of vector-quantized (VQ) representations of modern autoencoders, tokens in texts, categories of atoms in molecules, etc. In this paper, we provide a theoretical and algorithmic foundation for solving SB in discrete spaces using the recently introduced Iterative Markovian Fitting (IMF) procedure. Specifically, we theoretically justify the convergence of discrete-time IMF (D-IMF) to SB in discrete spaces. This enables us to develop a practical computational algorithm for SB, which we call Categorical Schrödinger Bridge Matching (CSBM). We show the performance of CSBM via a series of experiments with synthetic data and VQ representations of images. The code of CSBM is available at https://github.com/gregkseno/csbm.
中文:通过引入迭代马尔可夫拟合程序,薛定谔桥框架被扩展到离散空间,由此开发出的分类薛定谔桥匹配算法在合成数据和图像数据上得到了验证。
English: The Schrödinger Bridge framework is extended to discrete spaces through the Iterative Markovian Fitting procedure, leading to the development of the Categorical Schrödinger Bridge Matching algorithm, which is validated on synthetic and image data.

Authors:Jue Gong, Jingkai Wang, Zheng Chen, Xing Liu, Hong Gu, Yulun Zhang, Xiaokang Yang
Title: Human Body Restoration with One-Step Diffusion Model and A New Benchmark
Abstract:
Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emph{PERSONA}) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emph{OSDHuman}, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at https://github.com/gobunu/OSDHuman.
中文: 本研究提出了一种高质量自动数据集构建流程(HQ-ACF)和新型一步扩散模型(OSDHuman),通过创建PERSONA数据集并采用高保真图像嵌入器避免误导提示,在人体修复任务中取得了优于现有方法的成果。
English: This study introduces a high-quality automated dataset creation pipeline (HQ-ACF) and a novel one-step diffusion model (OSDHuman) for human body restoration, achieving superior results through a new PERSONA dataset and avoiding misleading prompts with a high-fidelity image embedder.

Authors:Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang
Title: AdaSVD: Adaptive Singular Value Decomposition for Large Language Models
Abstract:
Large language models (LLMs) have achieved remarkable success in natural language processing (NLP) tasks, yet their substantial memory requirements present significant challenges for deployment on resource-constrained devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LLMs, offering considerable reductions in memory overhead. However, existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation, leading to a noticeable performance gap when compared to the original models. Furthermore, applying a uniform compression ratio across all transformer layers fails to account for the varying importance of different layers. To address these challenges, we propose AdaSVD, an adaptive SVD-based LLM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices $\mathcal{U}$ and $\mathcal{V}^\top$. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios based on the relative importance of each layer. Extensive experiments across multiple LLM/VLM families and evaluation metrics demonstrate that AdaSVD consistently outperforms state-of-the-art (SOTA) SVD-based methods, achieving superior performance with significantly reduced memory requirements. Code and models of AdaSVD will be available at https://github.com/ZHITENGLI/AdaSVD.
Chinese: AdaSVD是一种自适应大语言模型压缩方法,通过误差补偿和分层压缩比分配,在显著降低内存占用的同时超越了现有SVD技术的性能表现。
English: AdaSVD is an adaptive LLM compression method that uses error compensation and layer-specific compression ratios to outperform existing SVD techniques while significantly reducing memory usage.

Authors:Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey
Title: Detecting Backdoor Samples in Contrastive Language Image Pretraining
Abstract:
Contrastive language-image pretraining (CLIP) has been found to be vulnerable to poisoning backdoor attacks where the adversary can achieve an almost perfect attack success rate on CLIP models by poisoning only 0.01\% of the training dataset. This raises security concerns on the current practice of pretraining large-scale models on unscrutinized web data using CLIP. In this work, we analyze the representations of backdoor-poisoned samples learned by CLIP models and find that they exhibit unique characteristics in their local subspace, i.e., their local neighborhoods are far more sparse than that of clean samples. Based on this finding, we conduct a systematic study on detecting CLIP backdoor attacks and show that these attacks can be easily and efficiently detected by traditional density ratio-based local outlier detectors, whereas existing backdoor sample detection methods fail. Our experiments also reveal that an unintentional backdoor already exists in the original CC3M dataset and has been trained into a popular open-source model released by OpenCLIP. Based on our detector, one can clean up a million-scale web dataset (e.g., CC3M) efficiently within 15 minutes using 4 Nvidia A100 GPUs. The code is publicly available in our \href{https://github.com/HanxunH/Detect-CLIP-Backdoor-Samples}{GitHub repository}.
中文:CLIP模型极易因微量数据投毒遭受后门攻击,但基于其稀疏局部特征的传统离群点检测器能高效识别恶意样本,实现大规模数据集的快速净化。
English: CLIP models are highly susceptible to backdoor attacks through minimal data poisoning, but these malicious samples can be efficiently detected using local outlier detectors based on their sparse local neighborhoods, enabling rapid cleanup of large datasets.

Authors:Oussama Zekri, Nicolas Boullé
Title: Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods
Abstract:
Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO.
Chinese: 本文提出了评分熵策略优化(SEPO),一种高效且理论可靠的策略梯度算法,用于针对不可微分奖励微调离散扩散模型,并在多个生成任务中验证了其有效性。
English: This paper introduces Score Entropy Policy Optimization (SEPO), an efficient and theoretically grounded policy gradient algorithm for fine-tuning discrete diffusion models with non-differentiable rewards, demonstrating its effectiveness across various generative tasks.

Authors:Shaofeng Yin, Jialong Wu, Siqiao Huang, Xingjian Su, Xu He, Jianye Hao, Mingsheng Long
Title: Trajectory World Models for Heterogeneous Environments
Abstract:
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models on top of this low-dimensional sensor information. In this work, we explore pre-training world models for heterogeneous environments by addressing key transfer barriers in both data diversity and model flexibility. We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity. Additionally, we propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context. Pre-training TrajWorld on UniTraj yields substantial gains in transition prediction, achieves a new state-of-the-art for off-policy evaluation, and also delivers superior online performance of model predictive control. To the best of our knowledge, this work, for the first time, demonstrates the transfer benefits of world models across heterogeneous and complex control environments. Code and data are available at https://github.com/thuml/TrajWorld.
中文: 本研究提出包含80个环境中超百万轨迹的统一数据集UniTraj和灵活架构TrajWorld,通过克服数据多样性与模型灵活性障碍,首次实现了异构环境下世界模型的预训练迁移,在预测与控制任务中达到最优性能。
English: This study introduces UniTraj, a unified dataset with over one million trajectories, and TrajWorld, a flexible architecture that overcomes data and model barriers to enable world model pre-training across heterogeneous environments, achieving state-of-the-art performance in prediction and control tasks.

Authors:Nikita Gushchin, David Li, Daniil Selikhanovych, Evgeny Burnaev, Dmitry Baranchuk, Alexander Korotin
Title: Inverse Bridge Matching Distillation
Abstract:
Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup. We provide the code at https://github.com/ngushchin/IBMD
Chinese Summary: 作者提出了一种基于逆向桥匹配的新型蒸馏技术,可将扩散桥模型的推理速度提升4至100倍,并在多种图像处理任务中保持或超越原始模型的生成质量。
English Summary: The authors introduce a novel distillation technique using inverse bridge matching to significantly accelerate diffusion bridge models' inference by 4x to 100x while maintaining or improving generation quality across various image tasks.

Authors:Xiao Lin, Yun Peng, Liuyi Wang, Xianyou Zhong, Minghao Zhu, Jingwei Yang, Yi Feng, Chengju Liu, Qijun Chen
Title: CleanPose: Category-Level Object Pose Estimation via Causal Learning and Knowledge Distillation
Abstract:
Category-level object pose estimation aims to recover the rotation, translation and size of unseen instances within predefined categories. In this task, deep neural network-based methods have demonstrated remarkable performance. However, previous studies show they suffer from spurious correlations raised by "unclean" confounders in models, hindering their performance on novel instances with significant variations. To address this issue, we propose CleanPose, a novel approach integrating causal learning and knowledge distillation to enhance category-level pose estimation. To mitigate the negative effect of unobserved confounders, we develop a causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further improve generalization ability, we devise a residual-based knowledge distillation method that has proven effective in providing comprehensive category information guidance. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over state-of-the-art methods. Code will be available at https://github.com/chrislin0621/CleanPose.
Chinese: CleanPose通过引入因果推理模块和基于残差的知识蒸馏方法,有效减少类别级物体姿态估计中的虚假相关性并提升泛化能力,在多个基准测试中展现出卓越性能。
English: CleanPose introduces a causal inference module and residual-based knowledge distillation to mitigate spurious correlations and enhance generalization in category-level object pose estimation, demonstrating superior performance across multiple benchmarks.

Authors:Nimisha Ghosh, Pratik Dutta, Daniele Santoni
Title: TFBS-Finder: Deep Learning-based Model with DNABERT and Convolutional Networks to Predict Transcription Factor Binding Sites
Abstract:
Transcription factors are proteins that regulate the expression of genes by binding to specific genomic regions known as Transcription Factor Binding Sites (TFBSs), typically located in the promoter regions of those genes. Accurate prediction of these binding sites is essential for understanding the complex gene regulatory networks underlying various cellular functions. In this regard, many deep learning models have been developed for such prediction, but there is still scope of improvement. In this work, we have developed a deep learning model which uses pre-trained DNABERT, a Convolutional Neural Network (CNN) module, a Modified Convolutional Block Attention Module (MCBAM), a Multi-Scale Convolutions with Attention (MSCA) module and an output module. The pre-trained DNABERT is used for sequence embedding, thereby capturing the long-term dependencies in the DNA sequences while the CNN, MCBAM and MSCA modules are useful in extracting higher-order local features. TFBS-Finder is trained and tested on 165 ENCODE ChIP-seq datasets. We have also performed ablation studies as well as cross-cell line validations and comparisons with other models. The experimental results show the superiority of the proposed method in predicting TFBSs compared to the existing methodologies. The codes and the relevant datasets are publicly available at https://github.com/NimishaGhosh/TFBS-Finder/.
中文: 本研究提出了TFBS-Finder深度学习模型,它结合DNABERT、CNN和注意力机制来改进转录因子结合位点的预测,并通过全面测试验证了其优于现有方法的性能。
English: This study introduces TFBS-Finder, a deep learning model that integrates DNABERT, CNN, and attention mechanisms to enhance transcription factor binding site prediction, demonstrating superior performance over existing methods through comprehensive testing and validation.

Authors:Haiduo Huang, Tian Xia, Wenzhe zhao, Pengju Ren
Title: Partial Channel Network: Compute Fewer, Perform Better
Abstract:
Designing a module or mechanism that enables a network to maintain low parameters and FLOPs without sacrificing accuracy and throughput remains a challenge. To address this challenge and exploit the redundancy within feature map channels, we propose a new solution: partial channel mechanism (PCM). Specifically, through the split operation, the feature map channels are divided into different parts, with each part corresponding to different operations, such as convolution, attention, pooling, and identity mapping. Based on this assumption, we introduce a novel partial attention convolution (PATConv) that can efficiently combine convolution with visual attention. Our exploration indicates that the PATConv can completely replace both the regular convolution and the regular visual attention while reducing model parameters and FLOPs. Moreover, PATConv can derive three new types of blocks: Partial Channel-Attention block (PAT_ch), Partial Spatial-Attention block (PAT_sp), and Partial Self-Attention block (PAT_sf). In addition, we propose a novel dynamic partial convolution (DPConv) that can adaptively learn the proportion of split channels in different layers to achieve better trade-offs. Building on PATConv and DPConv, we propose a new hybrid network family, named PartialNet, which achieves superior top-1 accuracy and inference speed compared to some SOTA models on ImageNet-1K classification and excels in both detection and segmentation on the COCO dataset. Our code is available at https://github.com/haiduo/PartialNet.
中文摘要:提出的PartialNet框架通过引入部分通道机制和动态部分卷积,在保持精度的同时有效减少模型参数和计算量,在ImageNet-1K和COCO基准测试中实现了最优性能。
English Summary: The proposed PartialNet framework introduces partial channel mechanisms and dynamic partial convolution to efficiently reduce model parameters and FLOPs while maintaining accuracy, achieving state-of-the-art performance on ImageNet-1K and COCO benchmarks.

Authors:Thanh-Tung Nguyen, Lucas Liebe, Nhat-Quang Tau, Yuheng Wu, Jinghan Cheng, Dongman Lee
Title: OCTOPINF: Workload-Aware Inference Serving for Edge Video Analytics
Abstract:
Edge Video Analytics (EVA) has gained significant attention as a major application of pervasive computing, enabling real-time visual processing. EVA pipelines, composed of deep neural networks (DNNs), typically demand efficient inference serving under stringent latency requirements, which is challenging due to the dynamic Edge environments (e.g., workload variability and network instability). Moreover, EVA pipelines also face significant resource contention caused by resource (e.g., GPU) constraints at the Edge. In this paper, we introduce OCTOPINF, a novel resource-efficient and workload-aware inference serving system designed for real-time EVA. OCTOPINF tackles the unique challenges of dynamic edge environments through fine-grained resource allocation, adaptive batching, and workload balancing between edge devices and servers. Furthermore, we propose a spatiotemporal scheduling algorithm that optimizes the co-location of inference tasks on GPUs, improving performance and ensuring service-level objectives (SLOs) compliance. Extensive evaluations on a real-world testbed demonstrate the effectiveness of our approach. It achieves an effective throughput increase of up to 10x compared to the baselines and shows better robustness in challenging scenarios. OCTOPINF can be used for any DNN-based EVA inference task with minimal adaptation and is available at https://github.com/tungngreen/PipelineScheduler.
中文:OCTOPINF是一种面向实时边缘视频分析的高效资源与负载感知推理服务系统,通过细粒度资源分配和自适应批处理应对动态边缘环境挑战,实现了高达10倍的吞吐量提升和更优的鲁棒性。
English: OCTOPINF is a resource-efficient and workload-aware inference serving system for real-time Edge Video Analytics, addressing dynamic edge challenges through fine-grained resource allocation and adaptive batching to achieve up to 10x throughput improvement and robust performance.

Authors:Eun-Sol Park, MiSo Park, Seung Park, Yong-Goo Shin
Title: FSPGD: Rethinking Black-box Attacks on Semantic Segmentation
Abstract:
Transferability, the ability of adversarial examples crafted for one model to deceive other models, is crucial for black-box attacks. Despite advancements in attack methods for semantic segmentation, transferability remains limited, reducing their effectiveness in real-world applications. To address this, we introduce the Feature Similarity Projected Gradient Descent (FSPGD) attack, a novel black-box approach that enhances both attack performance and transferability. Unlike conventional segmentation attacks that rely on output predictions for gradient calculation, FSPGD computes gradients from intermediate layer features. Specifically, our method introduces a loss function that targets local information by comparing features between clean images and adversarial examples, while also disrupting contextual information by accounting for spatial relationships between objects. Experiments on Pascal VOC 2012 and Cityscapes datasets demonstrate that FSPGD achieves superior transferability and attack performance, establishing a new state-of-the-art benchmark. Code is available at https://github.com/KU-AIVS/FSPGD.
Chinese: FSPGD攻击通过从中间层特征计算梯度并引入针对局部和上下文信息的损失函数,显著提升了语义分割黑盒攻击中对抗样本的可迁移性。
English: The FSPGD attack enhances adversarial example transferability in black-box attacks on semantic segmentation by computing gradients from intermediate features and introducing a loss function that targets both local and contextual information.

Authors:Yuheng Li, Panpan Wang, Haipeng Chen
Title: Can Reinforcement Learning Solve Asymmetric Combinatorial-Continuous Zero-Sum Games?
Abstract:
There have been extensive studies on learning in zero-sum games, focusing on the analysis of the existence and algorithmic convergence of Nash equilibrium (NE). Existing studies mainly focus on symmetric games where the strategy spaces of the players are of the same type and size. For the few studies that do consider asymmetric games, they are mostly restricted to matrix games. In this paper, we define and study a new practical class of asymmetric games called two-player Asymmetric Combinatorial-Continuous zEro-Sum (ACCES) games, featuring a combinatorial action space for one player and an infinite compact space for the other. Such ACCES games have broad implications in the real world, particularly in combinatorial optimization problems (COPs) where one player optimizes a solution in a combinatorial space, and the opponent plays against it in an infinite (continuous) compact space (e.g., a nature player deciding epistemic parameters of the environmental model). Our first key contribution is to prove the existence of NE for two-player ACCES games, using the idea of essentially finite game approximation. Building on the theoretical insights and double oracle (DO)-based solutions to complex zero-sum games, our second contribution is to design the novel algorithm, Combinatorial Continuous DO (CCDO), to solve ACCES games, and prove the convergence of the proposed algorithm. Considering the NP-hardness of most COPs and recent advancements in reinforcement learning (RL)-based solutions to COPs, our third contribution is to propose a practical algorithm to solve NE in the real world, CCDORL (based on CCDO), and provide the novel convergence analysis in the ACCES game. Experimental results across diverse instances of COPs demonstrate the empirical effectiveness of our algorithms. The code of this work is available at https://github.com/wmd3i/CCDO-RL.
中文: 本文提出了一种新型非对称零和博弈ACCES,证明了纳什均衡的存在性,并开发了具有收敛保证的高效算法,适用于现实世界问题。
English: This paper introduces a new class of asymmetric zero-sum games called ACCES, proves the existence of Nash equilibrium, and develops efficient algorithms with convergence guarantees for real-world applications.

Authors:Ismail Khalfaoui-Hassani, Stefan Kesselheim
Title: Polynomial, trigonometric, and tropical activations
Abstract:
Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library, which can be accessed via: https://github.com/K-H-Ismail/torchortho.
中文摘要:本文研究表明,基于正交基(如埃尔米特多项式和傅里叶基)的激活函数能有效用于深度神经网络训练,不仅解决了梯度爆炸/消失问题,还能通过插值逼近经典激活函数,特别适用于微调任务。
English Summary: This article demonstrates that orthonormal basis functions, such as Hermite polynomials and Fourier bases, can effectively serve as activations in deep neural networks, enabling stable training without special mechanisms while providing insights into network structure and approximation capabilities.

Authors:Yuanhe Zhang, Fanghui Liu, Yudong Chen
Title: LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently
Abstract:
This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA, Hu et al. 2022) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately and applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation. Code is available at: https://github.com/YuanheZ/LoRA-One.
中文: 本文通过理论证明LoRA适配器在微调过程中会与梯度特定子空间对齐,并提出理论驱动的LoRA-One算法,该算法在多个基准测试中实现了更快的收敛速度和更优的性能表现。
English: This paper demonstrates that LoRA adapters align with specific gradient subspaces during fine-tuning and introduces LoRA-One, a theory-driven algorithm that achieves faster convergence and better performance across various benchmarks.

Authors:Tongkun Liu, Bing Li, Xiao Jin, Yupeng Shi, Qiuying Li, Xiang Wei
Title: Exploring Few-Shot Defect Segmentation in General Industrial Scenarios with Metric Learning and Vision Foundation Models
Abstract:
Industrial defect segmentation is critical for manufacturing quality control. Due to the scarcity of training defect samples, few-shot semantic segmentation (FSS) holds significant value in this field. However, existing studies mostly apply FSS to tackle defects on simple textures, without considering more diverse scenarios. This paper aims to address this gap by exploring FSS in broader industrial products with various defect types. To this end, we contribute a new real-world dataset and reorganize some existing datasets to build a more comprehensive few-shot defect segmentation (FDS) benchmark. On this benchmark, we thoroughly investigate metric learning-based FSS methods, including those based on meta-learning and those based on Vision Foundation Models (VFMs). We observe that existing meta-learning-based methods are generally not well-suited for this task, while VFMs hold great potential. We further systematically study the applicability of various VFMs in this task, involving two paradigms: feature matching and the use of Segment Anything (SAM) models. We propose a novel efficient FDS method based on feature matching. Meanwhile, we find that SAM2 is particularly effective for addressing FDS through its video track mode. The contributed dataset and code will be available at: https://github.com/liutongkun/GFDS.
中文: 本文构建了全面的少样本缺陷分割基准,针对工业场景中的多样化缺陷检测问题,发现视觉基础模型潜力显著,并提出新型特征匹配方法,同时验证了SAM2模型在视频追踪模式下的卓越性能。
English: This paper introduces a comprehensive few-shot defect segmentation benchmark to address industrial defect detection across diverse scenarios, finding that Vision Foundation Models show strong potential while proposing a novel feature-matching method and highlighting SAM2's effectiveness.

Authors:Haiduo Huang, Zhenhua Liu, Tian Xia, Wenzhe zhao, Pengju Ren
Title: Nearly Lossless Adaptive Bit Switching
Abstract:
Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-Aware Training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the Double Rounding quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (ALRS) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-Aware Stochastic Bit-switching (HASB) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. We also validate the feasibility of our method on detection and segmentation tasks, as well as on LLMs task. Our codes are available at https://github.com/haiduo/Double-Rounding.
中文摘要:本文提出双舍入量化方法和自适应学习率调整技术,通过一次性联合训练实现多精度模型的高效转换,在减少存储的同时保持高精度,并在多项视觉任务中验证了其优越性。
English Summary: The paper introduces Double Rounding quantization and Adaptive Learning Rate Scaling to enable efficient one-shot joint training of multiple precision models, achieving nearly lossless bit-switching while reducing storage and maintaining high accuracy across various tasks.

Authors:Anam Zahid, Abdur Rehman Ali, Shaina Raza, Rai Shahnawaz, Faisal Kamiran, Asim Karim
Title: FairUDT: Fairness-aware Uplift Decision Trees
Abstract:
Training data used for developing machine learning classifiers can exhibit biases against specific protected attributes. Such biases typically originate from historical discrimination or certain underlying patterns that disproportionately under-represent minority groups, such as those identified by their gender, religion, or race. In this paper, we propose a novel approach, FairUDT, a fairness-aware Uplift-based Decision Tree for discrimination identification. FairUDT demonstrates how the integration of uplift modeling with decision trees can be adapted to include fair splitting criteria. Additionally, we introduce a modified leaf relabeling approach for removing discrimination. We divide our dataset into favored and deprived groups based on a binary sensitive attribute, with the favored dataset serving as the treatment group and the deprived dataset as the control group. By applying FairUDT and our leaf relabeling approach to preprocess three benchmark datasets, we achieve an acceptable accuracy-discrimination tradeoff. We also show that FairUDT is inherently interpretable and can be utilized in discrimination detection tasks. The code for this project is available https://github.com/ara-25/FairUDT
中文摘要:本文提出FairUDT方法,通过将提升建模与公平分割标准相结合,并采用改进的叶节点重标记技术,在保持可接受准确率的同时有效识别和消除机器学习分类器中的歧视问题。
English Summary: This paper introduces FairUDT, a fairness-aware uplift-based decision tree that integrates uplift modeling with fair splitting criteria and a modified leaf relabeling approach to mitigate discrimination in machine learning classifiers while maintaining acceptable accuracy.

Authors:Qianyu Guo, Jingrong Wu, Tianxing Wu, Haofen Wang, Weifeng Ge, Wenqiang Zhang
Title: Enhancing Environmental Robustness in Few-shot Learning via Conditional Representation Learning
Abstract:
Few-shot learning (FSL) has recently been extensively utilized to overcome the scarcity of training data in domain-specific visual recognition. In real-world scenarios, environmental factors such as complex backgrounds, varying lighting conditions, long-distance shooting, and moving targets often cause test images to exhibit numerous incomplete targets or noise disruptions. However, current research on evaluation datasets and methodologies has largely ignored the concept of "environmental robustness", which refers to maintaining consistent performance in complex and diverse physical environments. This neglect has led to a notable decline in the performance of FSL models during practical testing compared to their training performance. To bridge this gap, we introduce a new real-world multi-domain few-shot learning (RD-FSL) benchmark, which includes four domains and six evaluation datasets. The test images in this benchmark feature various challenging elements, such as camouflaged objects, small targets, and blurriness. Our evaluation experiments reveal that existing methods struggle to utilize training images effectively to generate accurate feature representations for challenging test images. To address this problem, we propose a novel conditional representation learning network (CRLNet) that integrates the interactions between training and testing images as conditional information in their respective representation processes. The main goal is to reduce intra-class variance or enhance inter-class variance at the feature representation level. Finally, comparative experiments reveal that CRLNet surpasses the current state-of-the-art methods, achieving performance improvements ranging from 6.83% to 16.98% across diverse settings and backbones. The source code and dataset are available at https://github.com/guoqianyu-alberta/Conditional-Representation-Learning.
中文: 针对小样本学习在现实环境中因缺乏环境鲁棒性而表现不佳的问题,我们提出了新的多领域基准和条件表示学习网络(CRLNet),该网络在不同设置下显著提升了模型性能。
English: Few-shot learning models often underperform in real-world environments due to unaddressed robustness issues, prompting the introduction of a new benchmark and a conditional representation learning network (CRLNet) that significantly enhances performance across diverse settings.

Authors:Wen Lai, Alexander Fraser, Ivan Titov
Title: Joint Localization and Activation Editing for Low-Resource Fine-Tuning
Abstract:
Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing (or steering) techniques, which modify the activations of specific model components. Due to their extremely small parameter counts, these methods show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods. The code for the method is released at https://github.com/wenlai-lavine/jola.
Chinese: 提出的JoLA方法联合学习需要编辑的Transformer头部、干预类型(加法/乘法)及干预参数,在有限训练数据下,于多项基准测试中持续超越现有方法。
English: The proposed JoLA method jointly learns which Transformer heads to edit, the type of intervention (additive/multiplicative), and the intervention parameters, consistently outperforming existing methods across multiple benchmarks despite limited training data.

Authors:Erpai Luo, Xinran Wei, Lin Huang, Yunyang Li, Han Yang, Zaishuo Xia, Zun Wang, Chang Liu, Bin Shao, Jia Zhang
Title: Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity
Abstract:
Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost--driven by high-order tensor product (TP) operations--restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduce SPHNet, an efficient and scalable equivariant network, that incorporates adaptive SParsity into Hamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70%. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact. Our code can be found at https://github.com/microsoft/SPHNet.
中文摘要:SPHNet通过引入自适应稀疏性和三阶段调度器,在保持哈密顿矩阵预测精度的同时大幅降低计算成本,在基准数据集上实现了高达7倍的加速和最优性能。
English Summary: SPHNet introduces adaptive sparsity and a three-phase scheduler to significantly reduce computational costs while maintaining accuracy in Hamiltonian matrix prediction, achieving up to 7x speedup and state-of-the-art performance on benchmark datasets.

Authors:Chenyue Li, Wen Deng, Mengqian Lu, Binhang Yuan
Title: AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science
Abstract:
The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate service by offering a standard and rigorous evaluation framework. Our source codes are currently available at Our source codes are currently available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.
中文: AtmosSci-Bench是一个新颖的基准测试,通过双格式问题系统评估大语言模型在五大核心大气科学领域的表现,为推进LLM在气候服务中的应用提供了关键评估框架。
English: AtmosSci-Bench is a novel benchmark designed to systematically evaluate large language models' performance across five core atmospheric science categories through dual-format questions, providing crucial insights for advancing LLM applications in climate services.

Authors:Chenyue Li, Wen Deng, Mengqian Lu, Binhang Yuan
Title: AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science
Abstract:
The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges and boosting scientific discovery in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate services by offering a standard and rigorous evaluation framework. Our source code is available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.
中文: AtmosSci-Bench是一个新颖的基准测试,通过双格式问题系统评估大语言模型在五大核心大气科学领域的表现,为推进LLM在气候服务中的应用提供了关键评估框架。
English: AtmosSci-Bench is a novel benchmark designed to systematically evaluate large language models' performance across five core atmospheric science categories through dual-format questions, providing crucial insights for advancing LLM applications in climate services.

Authors:Charilaos I. Kanatsoulis, Evelyn Choi, Stephanie Jegelka, Jure Leskovec, Alejandro Ribeiro
Title: Learning Efficient Positional Encodings with Graph Neural Networks
Abstract:
Positional encodings (PEs) are essential for effective graph representation learning because they provide position awareness in inherently position-agnostic transformer architectures and increase the expressive capacity of Graph Neural Networks (GNNs). However, designing powerful and efficient PEs for graphs poses significant challenges due to the absence of canonical node ordering and the scale of the graph. {In this work, we identify four key properties that graph PEs should satisfy}: stability, expressive power, scalability, and genericness. We find that existing eigenvector-based PE methods often fall short of jointly satisfying these criteria. To address this gap, we introduce PEARL, a novel framework of learnable PEs for graphs. Our primary insight is that message-passing GNNs function as nonlinear mappings of eigenvectors, enabling the design of GNN architectures for generating powerful and efficient PEs. A crucial challenge lies in initializing node attributes in a manner that is both expressive and permutation equivariant. We tackle this by initializing GNNs with random node inputs or standard basis vectors, thereby unlocking the expressive power of message-passing operations, while employing statistical pooling functions to maintain permutation equivariance. Our analysis demonstrates that PEARL approximates equivariant functions of eigenvectors with linear complexity, while rigorously establishing its stability and high expressive power. Experimental evaluations show that PEARL outperforms lightweight versions of eigenvector-based PEs and achieves comparable performance to full eigenvector-based PEs, but with one or two orders of magnitude lower complexity. Our code is available at https://github.com/ehejin/Pearl-PE.
中文: PEARL是一种创新的可学习位置编码框架,利用消息传递图神经网络生成高效强大的图位置编码,在保持与特征向量方法相当性能的同时显著降低了计算复杂度。
English: PEARL is a novel learnable positional encoding framework that leverages message-passing GNNs to generate powerful and efficient graph encodings, achieving comparable performance to eigenvector-based methods with significantly lower complexity.

Authors:Guanlin Li, Kangjie Chen, Shangwei Guo, Jie Zhang, Han Qiu, Chao Zhang, Guoyin Wang, Tianwei Zhang, Jiwei Li
Title: Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning
Abstract:
Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in https://github.com/GuanlinLee/llm_instruction_tuning.
Chinese: 在领域特定数据集上微调对齐的大型语言模型会削弱其安全对齐性,研究发现三个关键因素及当前奖励模型在保障安全方面的局限性。
English: Fine-tuning aligned large language models on domain-specific datasets can compromise their safety alignment, revealing three key factors and limitations in current reward models for maintaining safety.

Authors:Vernon Y. H. Toh, Yew Ken Chia, Deepanway Ghosal, Soujanya Poria
Title: The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Abstract:
The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models (including o1, o3, and o4-mini) on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA, which demand fine-grained visual perception. Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning. Nonetheless, despite these substantial advancements and the superior capabilities demonstrated by the o-[n] series, our findings highlight that even these leading models face persistent challenges. Difficulties are particularly evident in tasks requiring precise visual perception, robust compositional reasoning across multiple visual attributes, and solving complex algorithmic or highly combinatorial puzzles, indicating critical areas for future AGI development. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available at https://github.com/declare-lab/LLM-PuzzleTest.
中文: OpenAI的o系列模型在多模态推理能力上显著优于GPT系列,但在需要精确视觉感知和复杂组合推理的任务中仍面临挑战。
English: OpenAI's o-series models demonstrate superior multimodal reasoning capabilities over GPT-series models but still face challenges in tasks requiring precise visual perception and complex compositional reasoning.

Authors:Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim
Title: FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
Abstract:
While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to reduce latency for long-context inference. FastKV improves processing speed while preserving accuracy by adopting Token-Selective Propagation (TSP). This approach preserves full-context information in early layers of LLMs and selectively propagates only a portion of this information in later layers. This design enables FastKV to minimize redundant computation without sacrificing contextual fidelity. Our experimental results show that FastKV achieves up to 1.97$\times$ and 4.82$\times$ improvements in time-to-first-token (TTFT) and throughput, respectively, compared to baseline without KV cache compression. Moreover, FastKV successfully maintains accuracy within 1\% of the baseline on long-context benchmarks. Our code is available at https://github.com/dongwonjo/FastKV.
中文: FastKV是一种KV缓存压缩方法,通过令牌选择性传播技术,在保持长上下文基准测试准确率接近基线1%以内的同时,显著提升了首个令牌生成时间和系统吞吐量。
English: FastKV is a KV cache compression method that uses Token-Selective Propagation to significantly reduce latency and improve throughput in long-context LLM inference while maintaining accuracy within 1% of baseline performance.

Authors:Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan
Title: Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Abstract:
Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.
Chinese Summary: 本文提出潜在偏好优化(LPO)方法,通过将预训练扩散模型重新用作潜在空间的步级奖励模型,在显著提升训练速度的同时,更好地使生成图像符合人类审美和文本对齐等偏好。
English Summary: This paper introduces Latent Preference Optimization (LPO), a novel method that leverages pre-trained diffusion models as step-level reward models in latent space to better align generated images with human preferences while achieving significant training acceleration.

Authors:Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan
Title: Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Abstract:
Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.
Chinese Summary: 本文提出潜在偏好优化(LPO)方法,通过将预训练扩散模型重新用作潜在空间的步级奖励模型,在显著提升训练速度的同时,更好地使生成图像符合人类审美和文本对齐等偏好。
English Summary: This paper introduces Latent Preference Optimization (LPO), a novel method that leverages pre-trained diffusion models as step-level reward models in latent space to better align generated images with human preferences while achieving significant training acceleration.

Authors:Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Heng Ji, Denghui Zhang
Title: SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals
Abstract:
Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content. Existing safety mechanisms, while improving model safety, often lead to overly cautious behavior and fail to fully leverage LLMs' internal cognitive processes. Inspired by humans' reflective thinking capability, we first show that LLMs can similarly perform internal assessments about safety in their internal states. Building on this insight, we propose SafeSwitch, a dynamic framework that regulates unsafe outputs by utilizing the prober-based internal state monitor that actively detects harmful intentions, and activates a safety head that leads to safer and more conservative responses only when necessary. SafeSwitch reduces harmful outputs by approximately 80% on harmful queries while maintaining strong utility, reaching a Pareto optimal among several methods. Our method is also advantageous over traditional methods in offering more informative, context-aware refusals, and achieves these benefits while only tuning less than 6% of the original parameters. SafeSwitch demonstrates large language models' capacity for self-awareness and reflection regarding safety, offering a promising approach to more nuanced and effective safety controls. Codes for this work are available at https://github.com/Hanpx20/SafeSwitch.
中文摘要:SafeSwitch框架通过动态监控大语言模型的内部状态来检测有害意图,仅在必要时启动安全机制,仅调整不到6%的参数即可将有害输出减少约80%,同时保持模型实用性。
English Summary: The SafeSwitch framework dynamically monitors large language models' internal states to detect harmful intentions and activates safety mechanisms only when necessary, reducing harmful outputs by 80% while maintaining utility through minimal parameter adjustments.

Authors:Siqi Zeng, Yifei He, Weiqiu You, Yifan Hao, Yao-Hung Hubert Tsai, Makoto Yamada, Han Zhao
Title: Efficient Model Editing with Task Vector Bases: A Theoretical Framework and Scalable Approach
Abstract:
Task vectors, which are derived from the difference between pre-trained and fine-tuned model weights, enable flexible task adaptation and model merging through arithmetic operations such as addition and negation. However, existing approaches often rely on heuristics with limited theoretical support, often leading to performance gaps comparing to direct task fine tuning. Meanwhile, although it is easy to manipulate saved task vectors with arithmetic for different purposes, such compositional flexibility demands high memory usage, especially when dealing with a huge number of tasks, limiting scalability. This work addresses these issues with a theoretically grounded framework that explains task vector arithmetic and introduces the task vector bases framework. Building upon existing task arithmetic literature, our method significantly reduces the memory cost for downstream arithmetic with little effort, while achieving competitive performance and maintaining compositional advantage, providing a practical solution for large-scale task arithmetic. The code is available at https://github.com/uiuctml/TaskVectorBasis.
中文摘要:任务向量基是一种压缩框架,通过将多个任务向量表示为少量基向量的线性组合,在保持任务算术功能和性能的同时显著降低了存储与计算成本。
English Summary: Task Vector Bases is a compression framework that reduces storage and computation costs by representing multiple task vectors as linear combinations of fewer basis vectors while maintaining task arithmetic functionality and performance.

Authors:Siqi Zeng, Yifei He, Meitong Liu, Weiqiu You, Yifan Hao, Yao-Hung Hubert Tsai, Makoto Yamada, Han Zhao
Title: Task Vector Bases: A Unified and Scalable Framework for Compressed Task Arithmetic
Abstract:
Task arithmetic, representing downstream tasks through linear operations on task vectors, has emerged as a simple yet powerful paradigm for transferring knowledge across diverse settings. However, maintaining a large collection of task vectors introduces scalability challenges in both storage and computation. We propose Task Vector Bases, a framework compressing $T$ task vectors into $M < T$ basis vectors while preserving the functionality of task arithmetic. By representing each task vector as a structured linear combination of basis atoms, our approach supports standard operations such as addition, negation, as well as more advanced arithmetic ones. The framework is orthogonal to other efficiency-oriented improvements in task arithmetic and can be used in combination with them. We provide theoretical analysis showing that basis compression retains addition generalization guarantees and enables principled unlearning, with error bounds depending on reconstruction quality. Empirically, our proposed basis construction methods consistently outperform heuristic basis construction baselines and, in some cases, even surpass the performance of full task vector collections across diverse downstream applications while reducing storage and computational requirements. The code is available at https://github.com/uiuctml/TaskVectorBasis.
中文摘要:任务向量基是一种压缩框架,通过将多个任务向量表示为少量基向量的线性组合,在保持任务算术功能和性能的同时显著降低了存储与计算成本。
English Summary: Task Vector Bases is a compression framework that reduces storage and computation costs by representing multiple task vectors as linear combinations of fewer basis vectors while maintaining task arithmetic functionality and performance.

Authors:Wenfei Zhang, Ruipeng Zhao, Yongxiang Yao, Yi Wan, Peihao Wu, Jiayuan Li, Yansheng Li, Yongjun Zhang
Title: Multi-Resolution SAR and Optical Remote Sensing Image Registration Methods: A Review, Datasets, and Future Perspectives
Abstract:
Synthetic Aperture Radar (SAR) and optical image registration is essential for remote sensing data fusion, with applications in military reconnaissance, environmental monitoring, and disaster management. However, challenges arise from differences in imaging mechanisms, geometric distortions, and radiometric properties between SAR and optical images. As image resolution increases, fine SAR textures become more significant, leading to alignment issues and 3D spatial discrepancies. Two major gaps exist: the lack of a publicly available multi-resolution, multi-scene registration dataset and the absence of systematic analysis of current methods. To address this, the MultiResSAR dataset was created, containing over 10k pairs of multi-source, multi-resolution, and multi-scene SAR and optical images. Sixteen state-of-the-art algorithms were tested. Results show no algorithm achieves 100% success, and performance decreases as resolution increases, with most failing on sub-meter data. XoFTR performs best among deep learning methods (40.58%), while RIFT performs best among traditional methods (66.51%). Future research should focus on noise suppression, 3D geometric fusion, cross-view transformation modeling, and deep learning optimization for robust registration of high-resolution SAR and optical images. The dataset is available at https://github.com/betterlll/Multi-Resolution-SAR-dataset-.
中文摘要:MultiResSAR数据集填补了多分辨率SAR与光学图像配准公开数据的空白,测试表明尚无算法能完全成功配准,且分辨率越高性能越差,其中XoFTR和RIFT分别成为深度学习和传统方法中的最佳算法。
English Summary: The MultiResSAR dataset addresses the lack of public multi-resolution SAR-optical registration data, revealing that no tested algorithm achieves perfect success, with performance declining at higher resolutions and XoFTR and RIFT being the top deep learning and traditional methods respectively.

Authors:Jingyun Yang, Guoqing Zhang, Jingge Wang, Yang Li
Title: Adapting Foundation Models for Few-Shot Medical Image Segmentation: Actively and Sequentially
Abstract:
Recent advances in foundation models have brought promising results in computer vision, including medical image segmentation. Fine-tuning foundation models on specific low-resource medical tasks has become a standard practice. However, ensuring reliable and robust model adaptation when the target task has a large domain gap and few annotated samples remains a challenge. Previous few-shot domain adaptation (FSDA) methods seek to bridge the distribution gap between source and target domains by utilizing auxiliary data. The selection and scheduling of auxiliaries are often based on heuristics, which can easily cause negative transfer. In this work, we propose an Active and Sequential domain AdaPtation (ASAP) framework for dynamic auxiliary dataset selection in FSDA. We formulate FSDA as a multi-armed bandit problem and derive an efficient reward function to prioritize training on auxiliary datasets that align closely with the target task, through a single-round fine-tuning. Empirical validation on diverse medical segmentation datasets demonstrates that our method achieves favorable segmentation performance, significantly outperforming the state-of-the-art FSDA methods, achieving an average gain of 27.75% on MRI and 7.52% on CT datasets in Dice score. Code is available at the git repository: https://github.com/techicoco/ASAP.
中文:ASAP框架通过多臂老虎机方法动态选择辅助数据集,有效提升医学图像分割中的少样本域适应性能,相比现有方法取得显著改进。
English: The ASAP framework dynamically selects auxiliary datasets through a multi-armed bandit approach to enhance few-shot domain adaptation in medical image segmentation, achieving significant performance gains over existing methods.

Authors:Minh Ngoc Nguyen, Khai Le-Duc, Tan-Hanh Pham, Trang Nguyen, Quang Minh Luu, Ba Kien Tran, Truong-Son Hy, Viktor Dremin, Sergei Sokolovsky, Edik Rafailov
Title: A Wearable Device Dataset for Mental Health Assessment Using Laser Doppler Flowmetry and Fluorescence Spectroscopy Sensors
Abstract:
In this study, we introduce a novel method to predict mental health by building machine learning models for a non-invasive wearable device equipped with Laser Doppler Flowmetry (LDF) and Fluorescence Spectroscopy (FS) sensors. Besides, we present the corresponding dataset to predict mental health, e.g. depression, anxiety, and stress levels via the DAS-21 questionnaire. To our best knowledge, this is the world's largest and the most generalized dataset ever collected for both LDF and FS studies. The device captures cutaneous blood microcirculation parameters, and wavelet analysis of the LDF signal extracts key rhythmic oscillations. The dataset, collected from 132 volunteers aged 18-94 from 19 countries, explores relationships between physiological features, demographics, lifestyle habits, and health conditions. We employed a variety of machine learning methods to classify stress detection, in which LightGBM is identified as the most effective model for stress detection, achieving a ROC AUC of 0.7168 and a PR AUC of 0.8852. In addition, we also incorporated Explainable Artificial Intelligence (XAI) techniques into our analysis to investigate deeper insights into the model's predictions. Our results suggest that females, younger individuals and those with a higher Body Mass Index (BMI) or heart rate have a greater likelihood of experiencing mental health conditions like stress and anxiety. All related code and data are published online: https://github.com/leduckhai/Wearable_LDF-FS.
中文: 本研究通过结合激光多普勒血流仪和荧光光谱传感器的可穿戴设备,开发了预测心理健康的新方法,建立了全球最大相关数据集并确定LightGBM为最佳预测模型,同时利用可解释人工智能揭示了女性、年轻群体及高BMI人群更易出现心理问题。
English: This study presents a novel wearable device using LDF and FS sensors to predict mental health conditions, creating the world's largest dataset and identifying LightGBM as the most effective model with XAI insights revealing demographic risk factors.

Authors:Anuj Singh, Sayak Mukherjee, Ahmad Beirami, Hadi Jamali-Rad
Title: CoDe: Blockwise Control for Denoising Diffusion Models
Abstract:
Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time to enable sampling from the reward-tilted posterior. In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe), that circumvents the need for differentiable guidance functions and model finetuning. CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards. Our experiments demonstrate that, despite its simplicity, CoDe offers a favorable trade-off between reward alignment, prompt instruction following, and inference cost, achieving a competitive performance against the state-of-the-art baselines. Our code is available at: https://github.com/anujinho/code.
Chinese: 提出的受控去噪(CoDe)方法通过在推理过程中采用无需梯度的分块采样策略,使扩散模型能够与下游奖励对齐,无需模型微调或可微分指导,同时保持有竞争力的性能。
English: The proposed controlled denoising (CoDe) method enables alignment of diffusion models with downstream rewards through a gradient-free, blockwise sampling approach during inference, eliminating the need for model finetuning or differentiable guidance while maintaining competitive performance.

Authors:Harshith Padigela, Chintan Shah, Dinkar Juyal
Title: ML-Dev-Bench: Comparative Analysis of AI Agents on ML development workflows
Abstract:
In this report, we present ML-Dev-Bench, a benchmark aimed at testing agentic capabilities on applied Machine Learning development tasks. While existing benchmarks focus on isolated coding tasks or Kaggle-style competitions, ML-Dev-Bench tests agents' ability to handle the full complexity of ML development workflows. The benchmark assesses performance across critical aspects including dataset handling, model training, improving existing models, debugging, and API integration with popular ML tools. We evaluate three agents - ReAct, Openhands, and AIDE - on a diverse set of 30 tasks, providing insights into their strengths and limitations in handling practical ML development challenges. We open source the benchmark for the benefit of the community at \href{https://github.com/ml-dev-bench/ml-dev-bench}{https://github.com/ml-dev-bench/ml-dev-bench}.
中文: ML-Dev-Bench 是一个新颖的基准测试,旨在评估智能体处理完整机器学习开发工作流程的能力,包括数据集管理、模型训练和调试,并在30项任务中测试了三个智能体且公开了该基准。
English: ML-Dev-Bench is a novel benchmark designed to evaluate agents' capabilities in handling comprehensive machine learning development workflows, including dataset management, model training, and debugging, with three agents tested on 30 tasks and the benchmark made publicly available.

Authors:Moritz Wolter, Lokesh Veeramacheneni, Charles Tapley Hoyt
Title: More Rigorous Software Engineering Would Improve Reproducibility in Machine Learning Research
Abstract:
While experimental reproduction remains a pillar of the scientific method, we observe that the software best practices supporting the reproduction of machine learning ( ML ) research are often undervalued or overlooked, leading both to poor reproducibility and damage to trust in the ML community. We quantify these concerns by surveying the usage of software best practices in software repositories associated with publications at major ML conferences and journals such as NeurIPS, ICML, ICLR, TMLR, and MLOSS within the last decade. We report the results of this survey that identify areas where software best practices are lacking and areas with potential for growth in the ML community. Finally, we discuss the implications and present concrete recommendations on how we, as a community, can improve reproducibility in ML research.
中文: 摘要指出,机器学习研究中软件最佳实践的缺失损害了研究的可复现性和社区信任,通过对主要会议和期刊代码库的调查结果,提出了促进社区改进的具体建议。
English: The abstract highlights that inadequate adoption of software best practices in machine learning research undermines reproducibility and trust, as evidenced by a survey of repositories from major ML conferences and journals, and it proposes recommendations for community-wide improvements.

Authors:Can Jin, Ying Li, Mingyu Zhao, Shiyu Zhao, Zhenting Wang, Xiaoxiao He, Ligong Han, Tong Che, Dimitris N. Metaxas
Title: LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation
Abstract:
Visual prompting has gained popularity as a method for adapting pre-trained models to specific tasks, particularly in the realm of parameter-efficient tuning. However, existing visual prompting techniques often pad the prompt parameters around the image, limiting the interaction between the visual prompts and the original image to a small set of patches while neglecting the inductive bias present in shared information across different patches. In this study, we conduct a thorough preliminary investigation to identify and address these limitations. We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP), which enables shared and patch-specific information across rows and columns of image pixels. Extensive experiments across seven network architectures and four datasets demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods, achieving up to 6 times faster training times, utilizing 18 times fewer visual prompt parameters, and delivering a 3.1% improvement in performance. The code is available as https://github.com/jincan333/LoR-VP.
Chinese: 本研究提出LoR-VP,一种采用低秩矩阵乘法的新型视觉提示方法,增强了提示与图像间的交互,在效率和性能上均显著优于现有技术。
English: This study introduces LoR-VP, a novel visual prompting method using low-rank matrix multiplication to enhance interaction between prompts and images, achieving superior efficiency and performance improvements over existing techniques.

Authors:Ehsaneddin Asgari, Yassine El Kheir, Mohammad Ali Sadraei Javaheri
Title: MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Abstract:
Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: https://github.com/llm-lab-org/MorphBPE and https://tokenizer.llm-lab.org
中文:MorphBPE是一种融合形态学结构的BPE分词改进方法,通过引入形态一致性评估指标,在提升大语言模型分词效果和训练收敛速度的同时保持与现有流程的完全兼容。
English: MorphBPE enhances BPE tokenization by incorporating morphological awareness, improving linguistic fidelity and model efficiency in large language models while introducing novel evaluation metrics for better segmentation and interpretability.

Authors:Teng Xiao, Yige Yuan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, Vasant G Honavar
Title: SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Abstract:
Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches-even without any hyperparameters or a reference model . For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: https://github.com/tengxiao1/SimPER.
Chinese: 提出的SimPER算法采用无需超参数调优的偏好优化方法,通过逆困惑度对齐语言模型,在多个基准测试中无需参考模型即实现卓越性能。
English: The proposed SimPER algorithm introduces a hyperparameter-free preference optimization method that uses inverse perplexity to align language models, achieving superior performance across multiple benchmarks without requiring costly tuning or reference models.

Authors:Alireza Morsali, MohammadJavad Vaez, Mohammadhossein Soltani, Amirhossein Kazerouni, Babak Taati, Morteza Mohammad-Noori
Title: STAF: Sinusoidal Trainable Activation Functions for Implicit Neural Representation
Abstract:
Implicit Neural Representations (INRs) have emerged as a powerful framework for modeling continuous signals. The spectral bias of ReLU-based networks is a well-established limitation, restricting their ability to capture fine-grained details in target signals. While previous works have attempted to mitigate this issue through frequency-based encodings or architectural modifications, these approaches often introduce additional complexity and do not fully address the underlying challenge of learning high-frequency components efficiently. We introduce Sinusoidal Trainable Activation Functions (STAF), designed to directly tackle this limitation by enabling networks to adaptively learn and represent complex signals with higher precision and efficiency. STAF inherently modulates its frequency components, allowing for self-adaptive spectral learning. This capability significantly improves convergence speed and expressivity, making STAF highly effective for both signal representations and inverse problems. Through extensive evaluations across a range of tasks, including signal representation (shape, image, audio) and inverse problems (super-resolution, denoising), as well as neural radiance fields (NeRF), we demonstrate that STAF consistently outperforms state-of-the-art methods in accuracy and reconstruction fidelity. These results establish STAF as a robust solution to spectral bias and the capacity--convergence tradeoff, with broad applicability in computer vision and graphics. Our codebase is publicly accessible at https://github.com/AlirezaMorsali/STAF.
中文: STAF通过引入正弦可训练激活函数,有效克服了ReLU网络的频谱偏差,实现了高频分量的自适应学习,在多种信号表示和逆问题应用中均展现出卓越的性能。
English: STAF introduces sinusoidal trainable activation functions to overcome the spectral bias of ReLU networks, enabling adaptive learning of high-frequency components and achieving superior performance in signal representation and inverse problems across various applications.

Authors:Yongqiang Huang, Zerui Shao, Ziyuan Yang, Zexin Lu, Yi Zhang
Title: FedRIR: Rethinking Information Representation in Federated Learning
Abstract:
Mobile and Web-of-Things (WoT) devices at the network edge generate vast amounts of data for machine learning applications, yet privacy concerns hinder centralized model training. Federated Learning (FL) allows clients (devices) to collaboratively train a shared model coordinated by a central server without transfer private data, but inherent statistical heterogeneity among clients presents challenges, often leading to a dilemma between clients' needs for personalized local models and the server's goal of building a generalized global model. Existing FL methods typically prioritize either global generalization or local personalization, resulting in a trade-off between these two objectives and limiting the full potential of diverse client data. To address this challenge, we propose a novel framework that simultaneously enhances global generalization and local personalization by Rethinking Information Representation in the Federated learning process (FedRIR). Specifically, we introduce Masked Client-Specific Learning (MCSL), which isolates and extracts fine-grained client-specific features tailored to each client's unique data characteristics, thereby enhancing personalization. Concurrently, the Information Distillation Module (IDM) refines the global shared features by filtering out redundant client-specific information, resulting in a purer and more robust global representation that enhances generalization. By integrating the refined global features with the isolated client-specific features, we construct enriched representations that effectively capture both global patterns and local nuances, thereby improving the performance of downstream tasks on the client. The code is available at https://github.com/Deep-Imaging-Group/FedRIR.
中文:联邦学习面临全局泛化与本地个性化的权衡,而提出的FedRIR框架通过掩码客户端特定学习隔离个性化特征,并结合信息蒸馏模块优化全局表示,从而同时提升了两方面的性能。
English: Federated Learning faces a trade-off between global generalization and local personalization, but the proposed FedRIR framework simultaneously enhances both by isolating client-specific features through Masked Client-Specific Learning and refining global representations via an Information Distillation Module.

Authors:Yong Liu, Guo Qin, Zhiyuan Shi, Zhi Chen, Caiyin Yang, Xiangdong Huang, Jianmin Wang, Mingsheng Long
Title: Sundial: A Family of Highly Capable Time Series Foundation Models
Abstract:
We introduce Sundial, a family of native, flexible, and scalable time series foundation models. To predict the next-patch's distribution, we propose a TimeFlow Loss based on flow-matching, which facilitates native pre-training of Transformers on continuous-valued time series without discrete tokenization. Conditioned on arbitrary-length time series, our models are pre-trained without specifying any prior distribution and can generate multiple probable predictions, achieving more flexibility in representation learning than using parametric densities. Towards time series foundation models, we leverage minimal but crucial adaptations of Transformers and curate TimeBench with one trillion time points, comprising mostly real-world datasets and synthetic data. By mitigating mode collapse via TimeFlow Loss, we pre-train a family of Sundial models on TimeBench, which achieve unprecedented model capacity and generalization performance. In addition to excellent scalability, Sundial achieves state-of-the-art results on both point and probabilistic forecasting benchmarks with a just-in-time inference speed, i.e., making zero-shot predictions within a few milliseconds. We believe that Sundial's pioneering generative forecasting capability can improve model reliability in real-world decision-making. Code is available at: https://github.com/thuml/Sundial.
中文: Sundial是一系列原生时间序列基础模型,采用创新的TimeFlow损失函数在连续数据上进行灵活、可扩展的预训练,以快速推理实现了顶尖的预测性能。
English: Sundial is a family of native time series foundation models that utilize a novel TimeFlow Loss for flexible, scalable pre-training on continuous data, achieving state-of-the-art forecasting performance with rapid inference.

Authors:Leng Cai, Junxuan He, Yikai Li, Junjie Liang, Yuanping Lin, Ziming Quan, Yawen Zeng, Jin Xu
Title: RTBAgent: A LLM-based Agent System for Real-Time Bidding
Abstract:
Real-Time Bidding (RTB) enables advertisers to place competitive bids on impression opportunities instantaneously, striving for cost-effectiveness in a highly competitive landscape. Although RTB has widely benefited from the utilization of technologies such as deep learning and reinforcement learning, the reliability of related methods often encounters challenges due to the discrepancies between online and offline environments and the rapid fluctuations of online bidding. To handle these challenges, RTBAgent is proposed as the first RTB agent system based on large language models (LLMs), which synchronizes real competitive advertising bidding environments and obtains bidding prices through an integrated decision-making process. Specifically, obtaining reasoning ability through LLMs, RTBAgent is further tailored to be more professional for RTB via involved auxiliary modules, i.e., click-through rate estimation model, expert strategy knowledge, and daily reflection. In addition, we propose a two-step decision-making process and multi-memory retrieval mechanism, which enables RTBAgent to review historical decisions and transaction records and subsequently make decisions more adaptive to market changes in real-time bidding. Empirical testing with real advertising datasets demonstrates that RTBAgent significantly enhances profitability. The RTBAgent code will be publicly accessible at: https://github.com/CaiLeng/RTBAgent.
中文摘要:RTBAgent首次提出基于大语言模型的实时竞价代理系统,通过整合决策流程与辅助模块应对环境差异和市场波动,显著提升了广告投放的盈利效益。
English Summary: RTBAgent introduces the first large language model-based system for real-time bidding, addressing environmental discrepancies and market volatility through integrated decision-making and auxiliary modules to significantly boost profitability.

Authors:J Rosser, Jakob Nicolaus Foerster
Title: AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds via Self-Improvement
Abstract:
Scaffolding Large Language Models (LLMs) into multi-agent systems often improves performance on complex tasks, but the safety impact of such scaffolds has not been thoroughly explored. We introduce AgentBreeder, a framework for multi-objective self-improving evolutionary search over scaffolds. We evaluate discovered scaffolds on widely recognized reasoning, mathematics, and safety benchmarks and compare them with popular baselines. In 'blue' mode, we see a 79.4% average uplift in safety benchmark performance while maintaining or improving capability scores. In 'red' mode, we find adversarially weak scaffolds emerging concurrently with capability optimization. Our work demonstrates the risks of multi-agent scaffolding and provides a framework for mitigating them. Code is available at https://github.com/J-Rosser-UK/AgentBreeder.
中文: 该研究提出了AgentBreeder框架,在保持性能的同时将多智能体LLM系统的安全性提升高达79.4%,同时也揭示了对抗性支架带来的风险。
English: The study introduces AgentBreeder, a framework that enhances safety in multi-agent LLM systems by up to 79.4% while maintaining performance, while also revealing risks from adversarial scaffolds.

Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Title: BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts
Abstract:
Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose a new decision criterion where exit classifiers are treated as experts BEEM and aggregate their confidence scores. The confidence scores are aggregated only if neighbouring experts are consistent in prediction as the samples pass through them, thus capturing their ensemble effect. A sample exits when the aggregated confidence value exceeds a threshold. The threshold is set using the error rates of the intermediate exits aiming to surpass the performance of conventional DNN inference. Experimental results on the COCO dataset for Image captioning and GLUE datasets for various language tasks demonstrate that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5x to 2.1x. When compared to the final layer, its accuracy is comparable in harder Image Captioning and improves in the easier language tasks. The source code for this work is publicly available at https://github.com/Div290/BEEM1/tree/main
中文: BEEM提出通过聚合相邻分类器在预测一致时的置信度作为早退新标准,在图像描述和语言任务中实现1.5-2.1倍加速,同时保持或提升模型精度。
English: BEEM introduces a novel early exit criterion that aggregates confidence scores from neighboring classifiers when predictions are consistent, achieving 1.5x-2.1x speed-up while maintaining comparable or improved accuracy across vision and language tasks.

Authors:Kosuke Sakurai, Ryotaro Shimizu, Masayuki Goto
Title: Vision and Language Reference Prompt into SAM for Few-shot Segmentation
Abstract:
Segment Anything Model (SAM) represents a large-scale segmentation model that enables powerful zero-shot capabilities with flexible prompts. While SAM can segment any object in zero-shot, it requires user-provided prompts for each target image and does not attach any label information to masks. Few-shot segmentation models addressed these issues by inputting annotated reference images as prompts to SAM and can segment specific objects in target images without user-provided prompts. Previous SAM-based few-shot segmentation models only use annotated reference images as prompts, resulting in limited accuracy due to a lack of reference information. In this paper, we propose a novel few-shot segmentation model, Vision and Language reference Prompt into SAM (VLP-SAM), that utilizes the visual information of the reference images and the semantic information of the text labels by inputting not only images but also language as reference information. In particular, VLP-SAM is a simple and scalable structure with minimal learnable parameters, which inputs prompt embeddings with vision-language information into SAM using a multimodal vision-language model. To demonstrate the effectiveness of VLP-SAM, we conducted experiments on the PASCAL-5i and COCO-20i datasets, and achieved high performance in the few-shot segmentation task, outperforming the previous state-of-the-art model by a large margin (6.3% and 9.5% in mIoU, respectively). Furthermore, VLP-SAM demonstrates its generality in unseen objects that are not included in the training data. Our code is available at https://github.com/kosukesakurai1/VLP-SAM.
中文: 本文提出VLP-SAM这一新型小样本分割模型,通过将视觉和语言提示信息共同输入到Segment Anything模型中,显著提升了分割精度,在基准数据集上大幅超越了现有最优方法。
English: The paper introduces VLP-SAM, a novel few-shot segmentation model that enhances accuracy by integrating both visual and language prompts into the Segment Anything Model, achieving state-of-the-art performance on benchmark datasets.

Authors:Yunuo Chen, Qian Li, Bing He, Donghui Feng, Ronghua Wu, Qi Wang, Li Song, Guo Lu, Wenjun Zhang
Title: S2CFormer: Revisiting the RD-Latency Trade-off in Transformer-based Learned Image Compression
Abstract:
Transformer-based Learned Image Compression (LIC) suffers from a suboptimal trade-off between decoding latency and rate-distortion (R-D) performance. Moreover, the critical role of the FeedForward Network (FFN)-based channel aggregation module has been largely overlooked. Our research reveals that efficient channel aggregation-rather than complex and time-consuming spatial operations-is the key to achieving competitive LIC models. Based on this insight, we initiate the ``S2CFormer'' paradigm, a general architecture that simplifies spatial operations and enhances channel operations to overcome the previous trade-off. We present two instances of the S2CFormer: S2C-Conv, and S2C-Attention. Both models demonstrate state-of-the-art (SOTA) R-D performance and significantly faster decoding speed. Furthermore, we introduce S2C-Hybrid, an enhanced variant that maximizes the strengths of different S2CFormer instances to achieve a better performance-latency trade-off. This model outperforms all the existing methods on the Kodak, Tecnick, and CLIC Professional Validation datasets, setting a new benchmark for efficient and high-performance LIC. The code is at \href{https://github.com/YunuoChen/S2CFormer}{https://github.com/YunuoChen/S2CFormer}.
Chinese: S2CFormer架构通过强化通道操作和简化空间处理,突破了学习型图像压缩中解码延迟与率失真性能的权衡,实现了最先进的压缩效果。
English: The S2CFormer architecture enhances channel operations and simplifies spatial processes to overcome the trade-off between decoding latency and rate-distortion performance, achieving state-of-the-art results in learned image compression.

Authors:Linglong Wu, Xuhao Shan, Ruiquan Ge, Ruoyu Liang, Chi Zhang, Yonghong Li, Ahmed Elazab, Huoling Luo, Yunbi Liu, Changmiao Wang
Title: TMI-CLNet: Triple-Modal Interaction Network for Chronic Liver Disease Prognosis From Imaging, Clinical, and Radiomic Data Fusion
Abstract:
Chronic liver disease represents a significant health challenge worldwide and accurate prognostic evaluations are essential for personalized treatment plans. Recent evidence suggests that integrating multimodal data, such as computed tomography imaging, radiomic features, and clinical information, can provide more comprehensive prognostic information. However, modalities have an inherent heterogeneity, and incorporating additional modalities may exacerbate the challenges of heterogeneous data fusion. Moreover, existing multimodal fusion methods often struggle to adapt to richer medical modalities, making it difficult to capture inter-modal relationships. To overcome these limitations, We present the Triple-Modal Interaction Chronic Liver Network (TMI-CLNet). Specifically, we develop an Intra-Modality Aggregation module and a Triple-Modal Cross-Attention Fusion module, which are designed to eliminate intra-modality redundancy and extract cross-modal information, respectively. Furthermore, we design a Triple-Modal Feature Fusion loss function to align feature representations across modalities. Extensive experiments on the liver prognosis dataset demonstrate that our approach significantly outperforms existing state-of-the-art unimodal models and other multi-modal techniques. Our code is available at https://github.com/Mysterwll/liver.git.
中文:三重模态交互慢性肝网络(TMI-CLNet)通过专门设计的模块有效整合多模态数据,显著提升了慢性肝病预后评估的准确性,优于现有技术。
English: The Triple-Modal Interaction Chronic Liver Network (TMI-CLNet) effectively integrates multimodal data through specialized modules to enhance prognostic accuracy for chronic liver disease, outperforming existing methods.

Authors:Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li
Title: How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence
Abstract:
Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Measuring dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that evaluates dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings. Code is released in https://github.com/deeplearning-wisc/kernel-divergence-score.
中文: 提出的核散度评分(KDS)通过比较微调前后样本嵌入的核相似性矩阵,有效评估数据集污染问题,在受控实验中展现出与污染程度近乎完美的相关性,并优于现有基线方法。
English: The proposed Kernel Divergence Score (KDS) effectively measures dataset contamination by comparing kernel similarity matrices of sample embeddings before and after fine-tuning, demonstrating superior correlation with contamination levels and outperforming existing methods in controlled experiments.

Authors:Mingyu Chen, Yiding Chen, Wen Sun, Xuezhou Zhang
Title: Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based Exploration
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms in online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the scale of the reward function. This fundamental limitation hinders their effectiveness in scenarios with heavily skewed preferences, e.g. questions with a unique correct solution. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale, answering an open problem raised by Xie et al. (2024).. Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design. The code is available at https://github.com/MYC000801/SE-POPO.
Chinese: 本文提出SE-POPO算法,首次实现了样本复杂度与奖励规模的多项式关系,突破了现有在线RLHF方法的指数级增长瓶颈,在理论和实验验证中均展现出更优的样本效率。
English: This paper introduces SE-POPO, a novel online RLHF algorithm that achieves polynomial sample complexity scaling with reward size, solving a key limitation of existing methods and demonstrating superior efficiency in both theoretical analysis and empirical evaluations.

Authors:Mingyu Chen, Yiding Chen, Wen Sun, Xuezhou Zhang
Title: Avoiding $\mathbf{exp(R_{max})}$ scaling in RLHF through Preference-based Exploration
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal technique for large language model (LLM) alignment. This paper studies the setting of online RLHF and focus on improving sample efficiency. All existing algorithms in online RLHF, whether doing passive exploration or active exploration, suffer from a sample complexity that scales exponentially with the scale of the reward function. This fundamental limitation hinders their effectiveness in scenarios with heavily skewed preferences, e.g. questions with a unique correct solution. To address this, we introduce Self-Exploring Preference-Incentive Online Preference Optimization (SE-POPO), an online RLHF algorithm that for the first time achieves a sample complexity that scales polynomially with the reward scale, answering an open problem raised by Xie et al. (2024).. Theoretically, we demonstrate that the sample complexity of SE-POPO dominates that of existing exploration algorithms. Empirically, our systematic evaluation confirms that SE-POPO is more sample-efficient than both exploratory and non-exploratory baselines, in two primary application scenarios of RLHF as well as on public benchmarks, marking a significant step forward in RLHF algorithm design. The code is available at https://github.com/MYC000801/SE-POPO.
Chinese: 本文提出SE-POPO算法,首次实现了样本复杂度与奖励规模的多项式关系,突破了现有在线RLHF方法的指数级增长瓶颈,在理论和实验验证中均展现出更优的样本效率。
English: This paper introduces SE-POPO, a novel online RLHF algorithm that achieves polynomial sample complexity scaling with reward size, solving a key limitation of existing methods and demonstrating superior efficiency in both theoretical analysis and empirical evaluations.

Authors:Changseung Kim, Geunsik Bae, Woojae Shin, Sen Wang, Hyondong Oh
Title: EKF-Based Radar-Inertial Odometry with Online Temporal Calibration
Abstract:
Accurate time synchronization between heterogeneous sensors is crucial for ensuring robust state estimation in multi-sensor fusion systems. Sensor delays often cause discrepancies between the actual time when the event was captured and the time of sensor measurement, leading to temporal misalignment (time offset) between sensor measurement streams. In this paper, we propose an extended Kalman filter (EKF)-based radar-inertial odometry (RIO) framework that estimates the time offset online. The radar ego-velocity measurement model, derived from a single radar scan, is formulated to incorporate the time offset into the update. By leveraging temporal calibration, the proposed RIO enables accurate propagation and measurement updates based on a common time stream. Experiments on both simulated and real-world datasets demonstrate the accurate time offset estimation of the proposed method and its impact on RIO performance, validating the importance of sensor time synchronization. Our implementation of the EKF-RIO with online temporal calibration is available at https://github.com/spearwin/EKF-RIO-TC.
中文摘要:本文提出一种基于扩展卡尔曼滤波的雷达-惯性里程计框架,通过时间校准实现在线时间偏移估计,实验证明该方法能有效提升多传感器融合系统的状态估计精度。
English Summary: This paper introduces an EKF-based radar-inertial odometry framework that performs online time offset estimation through temporal calibration, significantly improving state estimation accuracy in multi-sensor systems as validated by experimental results.

Authors:Donglei Yu, Yang Zhao, Jie Zhu, Yangyifan Xu, Yu Zhou, Chengqing Zong
Title: SimulPL: Aligning Human Preferences in Simultaneous Machine Translation
Abstract:
Simultaneous Machine Translation (SiMT) generates translations while receiving streaming source inputs. This requires the SiMT model to learn a read/write policy, deciding when to translate and when to wait for more source input. Numerous linguistic studies indicate that audiences in SiMT scenarios have distinct preferences, such as accurate translations, simpler syntax, and no unnecessary latency. Aligning SiMT models with these human preferences is crucial to improve their performances. However, this issue still remains unexplored. Additionally, preference optimization for SiMT task is also challenging. Existing methods focus solely on optimizing the generated responses, ignoring human preferences related to latency and the optimization of read/write policy during the preference optimization phase. To address these challenges, we propose Simultaneous Preference Learning (SimulPL), a preference learning framework tailored for the SiMT task. In the SimulPL framework, we categorize SiMT human preferences into five aspects: \textbf{translation quality preference}, \textbf{monotonicity preference}, \textbf{key point preference}, \textbf{simplicity preference}, and \textbf{latency preference}. By leveraging the first four preferences, we construct human preference prompts to efficiently guide GPT-4/4o in generating preference data for the SiMT task. In the preference optimization phase, SimulPL integrates \textbf{latency preference} into the optimization objective and enables SiMT models to improve the read/write policy, thereby aligning with human preferences more effectively. Experimental results indicate that SimulPL exhibits better alignment with human preferences across all latency levels in Zh$\rightarrow$En, De$\rightarrow$En and En$\rightarrow$Zh SiMT tasks. Our data and code will be available at https://github.com/EurekaForNLP/SimulPL.
中文: 同步机器翻译需要模型在接收源输入时平衡翻译决策,而提出的SimulPL框架通过整合翻译质量、单调性、关键点、简洁性和延迟这五类人类偏好,有效优化了多语言任务中的表现和一致性。
English: Simultaneous Machine Translation requires models to balance translation decisions with input reception, and the proposed SimulPL framework addresses this by incorporating five human preferences—translation quality, monotonicity, key points, simplicity, and latency—to optimize performance and alignment in various language tasks.

Authors:Yujin Oh, Pengfei Jin, Sangjoon Park, Sekeun Kim, Siyeop Yoon, Kyungsang Kim, Jin Sung Kim, Xiang Li, Quanzheng Li
Title: Distribution-aware Fairness Learning in Medical Image Segmentation From A Control-Theoretic Perspective
Abstract:
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code will be made available. The source code is available at https://github.com/tvseg/dMoE.
Chinese: 本研究提出了分布感知专家混合模型(dMoE),通过整合人口统计学和临床因素来减轻医学图像分割中的偏差,在多个数据集上实现最优性能,并将控制理论与公平性学习相结合。
English: The study introduces Distribution-aware Mixture of Experts (dMoE), a method that integrates demographic and clinical factors to mitigate biases in medical image segmentation, achieving top performance across multiple datasets and bridging control theory with fairness learning.

Authors:Saarthak Kapse, Robin Betz, Srinivasan Sivanandan
Title: Fast Vision Mamba: Pooling Spatial Dimensions for Accelerated Processing
Abstract:
State Space Models (SSMs) with selective scan (Mamba) have been adapted into efficient vision models. Mamba, unlike Vision Transformers, achieves linear complexity for token interactions through a recurrent hidden state process. This sequential processing is enhanced by a parallel scan algorithm, which reduces the computational time of recurrent steps from $L$ sequential steps to $log(L)$ parallel steps with respect to the number of input tokens ($L$). In this work, we propose Fast Vision Mamba (FastVim), that further reduces the computational time of the SSM block by reducing the number of recurrent steps in Vision Mamba models while still retaining model performance. By alternately pooling tokens along image dimensions across Mamba blocks, we obtain a 2$\times$ reduction in the number of parallel steps in SSM block. Our model offers up to $72.5\%$ speedup in inference speed compared to baseline Vision Mamba models on high resolution (2048$\times$2048) images. Our experiments demonstrate state-of-the-art performance with dramatically improved throughput in a range of tasks such as image classification, cell perturbation prediction, segmentation, and object detection. Code is made available at https://github.com/insitro/FastVim
中文:FastVim通过跨Mamba块交替池化图像维度的标记,将SSM模块的并行步骤减半,在保持各类视觉任务顶尖性能的同时,实现高分辨率图像推理速度最高提升72.5%。
English: FastVim enhances Vision Mamba models by reducing recurrent steps through token pooling, achieving up to 72.5% faster inference on high-resolution images while maintaining top performance across vision tasks.

Authors:Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, Zexue He
Title: M+: Extending MemoryLLM with Scalable Long-Term Memory
Abstract:
Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead. We open-source our code at https://github.com/wangyu-ustc/MemoryLLM
Chinese: M+在MemoryLLM基础上引入长期记忆机制和联合训练的检索器,将知识保留能力从不足2万标记扩展到超过16万标记,同时保持相近的GPU内存开销。
English: M+ enhances MemoryLLM by integrating a long-term memory mechanism and a co-trained retriever, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory usage.

Authors:Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
Title: Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions
Abstract:
Emerging research has highlighted that artificial intelligence based multimodal fusion of digital pathology and transcriptomic features can improve cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction. However, such direct fusion for joint decision is impractical in real clinical settings, where histopathology is still the gold standard for diagnosis and transcriptomic tests are rarely requested, at least in the public healthcare system. With our novel diffusion based crossmodal generative AI model PathGen, we show that genomic expressions synthesized from digital histopathology jointly predicts cancer grading and patient survival risk with high accuracy (state-of-the-art performance), certainty (through conformal coverage guarantee) and interpretability (through distributed attention maps). PathGen code is available for open use by the research community through GitHub at https://github.com/Samiran-Dey/PathGen.
中文摘要:PathGen模型通过基于扩散的生成式人工智能,从数字病理学中合成基因组表达,无需实际转录组测试即可实现对癌症分级和患者生存风险的高精度、可解释预测。
English Summary: The PathGen model uses diffusion-based generative AI to synthesize genomic expressions from digital pathology, enabling accurate and interpretable predictions of cancer grading and patient survival risk without requiring actual transcriptomic tests.

Authors:Renhao Lu
Title: Complex Wavelet Mutual Information Loss: A Multi-Scale Loss Function for Semantic Segmentation
Abstract:
Recent advancements in deep neural networks have significantly enhanced the performance of semantic segmentation. However, class imbalance and instance imbalance remain persistent challenges, where smaller instances and thin boundaries are often overshadowed by larger structures. To address the multiscale nature of segmented objects, various models have incorporated mechanisms such as spatial attention and feature pyramid networks. Despite these advancements, most loss functions are still primarily pixel-wise, while regional and boundary-focused loss functions often incur high computational costs or are restricted to small-scale regions. To address this limitation, we propose the complex wavelet mutual information (CWMI) loss, a novel loss function that leverages mutual information from subband images decomposed by a complex steerable pyramid. The complex steerable pyramid captures features across multiple orientations and preserves structural similarity across scales. Meanwhile, mutual information is well-suited to capturing high-dimensional directional features and offers greater noise robustness. Extensive experiments on diverse segmentation datasets demonstrate that CWMI loss achieves significant improvements in both pixel-wise accuracy and topological metrics compared to state-of-the-art methods, while introducing minimal computational overhead. Our code is available at https://github.com/lurenhaothu/CWMI
中文: 提出的复杂小波互信息损失函数通过复杂可控金字塔分解和互信息利用多尺度结构特征,相比现有方法以最小计算开销实现了更优的分割精度和拓扑度量。
English: The proposed complex wavelet mutual information (CWMI) loss function leverages multiscale structural features through complex steerable pyramid decomposition and mutual information, achieving superior segmentation accuracy and topological metrics with minimal computational overhead compared to existing methods.

Authors:Mukesh Ghimire, Lei Zhang, Zhe Xu, Yi Ren
Title: A Scalable Solver for 2p0s Differential Games with One-Sided Payoff Information and Continuous Actions, States, and Time
Abstract:
Existing solvers for imperfect-information extensive-form games (IIEFGs) often struggle with scalability in terms of action and state space sizes and the number of time steps. However, many real-world games involve continuous action and state spaces and occur in continuous time, making them differential in nature. This paper addresses the scalability challenges for a representative class of two-player zero-sum (2p0s) differential games where the informed player knows the game type (payoff) while the uninformed one only has a prior belief over the set of possible types. Such games encompass a wide range of attack-defense scenarios, where the defender adapts based on their belief about the attacker's target. We make the following contributions: (1) We show that under the Isaacs' condition, the complexity of computing the Nash equilibrium for these games is not related to the action space size; and (2) we propose a multigrid approach to effectively reduce the cost of these games when many time steps are involved. Code for this work is available at https://github.com/ghimiremukesh/cams/tree/conf_sub.
中文: 本研究通过揭示均衡策略会集中在有限行动原型上,解决了双人不完美信息博弈的可扩展性难题,显著降低了博弈复杂度,并为足球策略等复杂场景提供了高效解决方案。
English: This research addresses the scalability challenge in two-player imperfect-information games by revealing that equilibrium strategies concentrate on limited action prototypes, significantly reducing game complexity and enabling efficient solutions for complex scenarios like football strategy.

Authors:Mukesh Ghimire, Lei Zhang, Zhe Xu, Yi Ren
Title: Solving Football by Exploiting Equilibrium Structure of 2p0s Differential Games with One-Sided Information
Abstract:
For a two-player imperfect-information extensive-form game (IIEFG) with $K$ time steps and a player action space of size $U$, the game tree complexity is $U^{2K}$, causing existing IIEFG solvers to struggle with large or infinite $(U,K)$, e.g., differential games with continuous action spaces. To partially address this scalability challenge, we focus on an important class of 2p0s games where the informed player (P1) knows the payoff while the uninformed player (P2) only has a belief over the set of $I$ possible payoffs. Such games encompass a wide range of scenarios in sports, defense, cybersecurity, and finance. We prove that under mild conditions, P1's (resp. P2's) equilibrium strategy at any infostate concentrates on at most $I$ (resp. $I+1$) action prototypes. When $I\ll U$, this equilibrium structure causes the game tree complexity to collapse to $I^K$ for P1 when P2 plays pure best responses, and $(I+1)^K$ for P2 in a dual game where P1 plays pure best responses. We then show that exploiting this structure in standard learning modes, i.e., model-free multiagent reinforcement learning and model predictive control, is straightforward, leading to significant improvements in learning accuracy and efficiency from SOTA IIEFG solvers. Our demonstration solves a 22-player football game ($K=10$, $U=\infty$) where the attacking team has to strategically conceal their intention until a critical moment in order to exploit information advantage. Code is available at https://github.com/ghimiremukesh/cams/tree/iclr
中文: 本研究通过揭示均衡策略会集中在有限行动原型上,解决了双人不完美信息博弈的可扩展性难题,显著降低了博弈复杂度,并为足球策略等复杂场景提供了高效解决方案。
English: This research addresses the scalability challenge in two-player imperfect-information games by revealing that equilibrium strategies concentrate on limited action prototypes, significantly reducing game complexity and enabling efficient solutions for complex scenarios like football strategy.

Authors:Zaitian Wang, Jian He, Yu Liang, Xiyuan Hu, Tianhao Peng, Kaixin Wang, Jiakai Wang, Chenlong Zhang, Weili Zhang, Shuang Niu, Xiaoyang Xie
Title: Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition
Abstract:
Emotions play a crucial role in human behavior and decision-making, making emotion recognition a key area of interest in human-computer interaction (HCI). This study addresses the challenges of emotion recognition by integrating facial expression analysis with electroencephalogram (EEG) signals, introducing a novel multimodal framework-Milmer. The proposed framework employs a transformer-based fusion approach to effectively integrate visual and physiological modalities. It consists of an EEG preprocessing module, a facial feature extraction and balancing module, and a cross-modal fusion module. To enhance visual feature extraction, we fine-tune a pre-trained Swin Transformer on emotion-related datasets. Additionally, a cross-attention mechanism is introduced to balance token representation across modalities, ensuring effective feature integration. A key innovation of this work is the adoption of a multiple instance learning (MIL) approach, which extracts meaningful information from multiple facial expression images over time, capturing critical temporal dynamics often overlooked in previous studies. Extensive experiments conducted on the DEAP dataset demonstrate the superiority of the proposed framework, achieving a classification accuracy of 96.72% in the four-class emotion recognition task. Ablation studies further validate the contributions of each module, highlighting the significance of advanced feature extraction and fusion strategies in enhancing emotion recognition performance. Our code are available at https://github.com/liangyubuaa/Milmer.
中文摘要:本研究提出名为Milmer的新型多模态框架,通过融合面部表情与脑电信号,采用基于Transformer的融合方法和多示例学习,在DEAP数据集上实现了96.72%的情感识别准确率。
English Summary: This study introduces Milmer, a novel multimodal framework that integrates facial expressions and EEG signals using transformer-based fusion and multiple instance learning to achieve 96.72% emotion recognition accuracy on the DEAP dataset.

Authors:Mohammad Nazeri, Anuj Pokhrel, Alexandyr Card, Aniket Datar, Garrett Warnell, Xuesu Xiao
Title: VertiFormer: A Data-Efficient Multi-Task Transformer for Off-Road Robot Mobility
Abstract:
Sophisticated learning architectures, e.g., Transformers, present a unique opportunity for robots to understand complex vehicle-terrain kinodynamic interactions for off-road mobility. While internet-scale data are available for Natural Language Processing (NLP) and Computer Vision (CV) tasks to train Transformers, real-world mobility data are difficult to acquire with physical robots navigating off-road terrain. Furthermore, training techniques specifically designed to process text and image data in NLP and CV may not apply to robot mobility. In this paper, we propose VertiFormer, a novel data-efficient multi-task Transformer model trained with only one hour of data to address such challenges of applying Transformer architectures for robot mobility on extremely rugged, vertically challenging, off-road terrain. Specifically, VertiFormer employs a new learnable masked modeling and next token prediction paradigm to predict the next pose, action, and terrain patch to enable a variety of off-road mobility tasks simultaneously, e.g., forward and inverse kinodynamics modeling. The non-autoregressive design mitigates computational bottlenecks and error propagation associated with autoregressive models. VertiFormer's unified modality representation also enhances learning of diverse temporal mappings and state representations, which, combined with multiple objective functions, further improves model generalization. Our experiments offer insights into effectively utilizing Transformers for off-road robot mobility with limited data and demonstrate our efficiently trained Transformer can facilitate multiple off-road mobility tasks onboard a physical mobile robot.
Chinese: VertiFormer是一种数据高效的多任务Transformer模型,仅用一小时数据即可预测机器人姿态、动作和地形区块,以解决越野移动中的数据稀缺和计算瓶颈问题。
English: VertiFormer is a data-efficient multi-task Transformer model that uses only one hour of data to predict robot poses, actions, and terrain patches for off-road mobility tasks, overcoming the limitations of data scarcity and computational bottlenecks.

Authors:David Oro, Carles Fernández, Xavier Martorell, Javier Hernando
Title: Work-Efficient Parallel Non-Maximum Suppression Kernels
Abstract:
In the context of object detection, sliding-window classifiers and single-shot Convolutional Neural Network (CNN) meta-architectures typically yield multiple overlapping candidate windows with similar high scores around the true location of a particular object. Non-Maximum Suppression (NMS) is the process of selecting a single representative candidate within this cluster of detections, so as to obtain a unique detection per object appearing on a given picture. In this paper, we present a highly scalable NMS algorithm for embedded GPU architectures that is designed from scratch to handle workloads featuring thousands of simultaneous detections on a given picture. Our kernels are directly applicable to other sequential NMS algorithms such as FeatureNMS, Soft-NMS or AdaptiveNMS that share the inner workings of the classic greedy NMS method. The obtained performance results show that our parallel NMS algorithm is capable of clustering 1024 simultaneous detected objects per frame in roughly 1 ms on both NVIDIA Tegra X1 and NVIDIA Tegra X2 on-die GPUs, while taking 2 ms on NVIDIA Tegra K1. Furthermore, our proposed parallel greedy NMS algorithm yields a 14x-40x speed up when compared to state-of-the-art NMS methods that require learning a CNN from annotated data.
中文: 本文提出了一种专为嵌入式GPU架构设计的高度可扩展并行非极大值抑制算法,在毫秒级时间内高效处理每帧多达1024个检测目标,相比现有方法实现了14至40倍的显著加速。
English: This paper introduces a highly scalable parallel Non-Maximum Suppression (NMS) algorithm designed for embedded GPU architectures, achieving significant speed improvements of 14x to 40x over existing methods while efficiently clustering up to 1024 detections per frame in milliseconds.

Authors:Yuxuan Chen, Xu Zhu, Hua Zhou, Zhuyin Ren
Title: MetaOpenFOAM 2.0: Large Language Model Driven Chain of Thought for Automating CFD Simulation and Post-Processing
Abstract:
Computational Fluid Dynamics (CFD) is widely used in aerospace, energy, and biology to model fluid flow, heat transfer, and chemical reactions. While Large Language Models (LLMs) have transformed various domains, their application in CFD remains limited, particularly for complex tasks like post-processing. To bridge this gap, we introduce MetaOpenFOAM 2.0, which leverages Chain of Thought (COT) decomposition and iterative verification to enhance accessibility for non-expert users through natural language inputs. Tested on a new benchmark covering simulation (fluid flow, heat transfer, combustion) and post-processing (extraction, visualization), MetaOpenFOAM 2.0 achieved an Executability score of 6.3/7 and a pass rate of 86.9%, significantly outperforming MetaOpenFOAM 1.0 (2.1/7, 0%). Additionally, it proved cost-efficient, averaging $0.15 per case. An ablation study confirmed that COT-driven decomposition and iterative refinement substantially improved task performance. Furthermore, scaling laws showed that increasing COT steps enhanced accuracy while raising token usage, aligning with LLM post-training scaling trends. These results highlight the transformative potential of LLMs in automating CFD workflows for industrial and research applications. Code is available at https://github.com/Terry-cyx/MetaOpenFOAM
中文:MetaOpenFOAM 2.0通过思维链分解与迭代验证实现了基于自然语言的CFD流程自动化,以86.9%的可执行率和显著成本效益展现了工业应用潜力。
English: MetaOpenFOAM 2.0 integrates Chain of Thought decomposition and iterative verification to enable natural language-based CFD automation, achieving 86.9% executability with significant cost efficiency.

Authors:David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos
Title: Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions
Abstract:
Visual speech recognition remains an open research problem where different challenges must be considered by dispensing with the auditory sense, such as visual ambiguities, the inter-personal variability among speakers, and the complex modeling of silence. Nonetheless, recent remarkable results have been achieved in the field thanks to the availability of large-scale databases and the use of powerful attention mechanisms. Besides, multiple languages apart from English are nowadays a focus of interest. This paper presents noticeable advances in automatic continuous lipreading for Spanish. First, an end-to-end system based on the hybrid CTC/Attention architecture is presented. Experiments are conducted on two corpora of disparate nature, reaching state-of-the-art results that significantly improve the best performance obtained to date for both databases. In addition, a thorough ablation study is carried out, where it is studied how the different components that form the architecture influence the quality of speech recognition. Then, a rigorous error analysis is carried out to investigate the different factors that could affect the learning of the automatic system. Finally, a new Spanish lipreading benchmark is consolidated. Code and trained models are available at https://github.com/david-gimeno/evaluating-end2end-spanish-lipreading.
中文摘要:本文提出了一种基于混合CTC/注意力架构的西班牙语端到端唇读系统,通过全面实验在两个语料库上取得最优性能,并建立了新的西班牙语唇读基准。
English Summary: This paper presents an end-to-end lipreading system for Spanish using a hybrid CTC/Attention architecture, achieving state-of-the-art results on two corpora through comprehensive experiments and establishing a new Spanish lipreading benchmark.

Authors:Kihwan Ryoo, Hyungtae Lim, Hyun Myung
Title: MambaGlue: Fast and Robust Local Feature Matching With Mamba
Abstract:
In recent years, robust matching methods using deep learning-based approaches have been actively studied and improved in computer vision tasks. However, there remains a persistent demand for both robust and fast matching techniques. To address this, we propose a novel Mamba-based local feature matching approach, called MambaGlue, where Mamba is an emerging state-of-the-art architecture rapidly gaining recognition for its superior speed in both training and inference, and promising performance compared with Transformer architectures. In particular, we propose two modules: a) MambaAttention mixer to simultaneously and selectively understand the local and global context through the Mamba-based self-attention structure and b) deep confidence score regressor, which is a multi-layer perceptron (MLP)-based architecture that evaluates a score indicating how confidently matching predictions correspond to the ground-truth correspondences. Consequently, our MambaGlue achieves a balance between robustness and efficiency in real-world applications. As verified on various public datasets, we demonstrate that our MambaGlue yields a substantial performance improvement over baseline approaches while maintaining fast inference speed. Our code will be available on https://github.com/url-kaist/MambaGlue
中文摘要:本文提出MambaGlue,一种基于Mamba架构的新型局部特征匹配方法,通过MambaAttention混合器实现局部与全局上下文理解,结合深度置信度回归器评估匹配精度,在保持快速推理的同时显著提升了匹配鲁棒性。
English Summary: This paper introduces MambaGlue, a novel local feature matching method leveraging the Mamba architecture to achieve robust performance and fast inference by incorporating a MambaAttention mixer for contextual understanding and a deep confidence score regressor for matching accuracy.

Authors:Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding
Title: UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs
Abstract:
Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at \url{https://github.com/Bostoncake/UniAttn}.
中文摘要:针对大型语言模型(LLM)后训练在现实应用中面临的高内存开销和推理延迟问题,UniAttn方法通过统一Transformer块中的Softmax操作,在保持性能的同时显著降低了推理成本。
English Summary: Post-training large language models (LLMs) for real-world use faces challenges like high memory usage and slow inference, which the proposed UniAttn method addresses by unifying Softmax operations across transformer blocks to cut costs without sacrificing performance.

Authors:Chuc Man Duc, Hiromichi Fukui
Title: SatMamba: Development of Foundation Models for Remote Sensing Imagery Using State Space Models
Abstract:
Foundation models refer to deep learning models pretrained on large unlabeled datasets through self-supervised algorithms. In the Earth science and remote sensing communities, there is growing interest in transforming the use of Earth observation data, including satellite and aerial imagery, through foundation models. Various foundation models have been developed for remote sensing, such as those for multispectral, high-resolution, and hyperspectral images, and have demonstrated superior performance on various downstream tasks compared to traditional supervised models. These models are evolving rapidly, with capabilities to handle multispectral, multitemporal, and multisensor data. Most studies use masked autoencoders in combination with Vision Transformers (ViTs) as the backbone for pretraining. While the models showed promising performance, ViTs face challenges, such as quadratic computational scaling with input length, which may limit performance on multiband and multitemporal data with long sequences. This research aims to address these challenges by proposing SatMamba, a new pretraining framework that combines masked autoencoders with State Space Model, offering linear computational scaling. Experiments on high-resolution imagery across various downstream tasks show promising results, paving the way for more efficient foundation models and unlocking the full potential of Earth observation data. The source code is available in https://github.com/mdchuc/HRSFM.
地球科学中的基础模型正快速发展以更高效处理复杂遥感数据,SatMamba通过结合状态空间模型提出线性计算扩展的新方法,在下游任务中展现出优越性能。
Foundation models in Earth science are evolving to process complex remote sensing data more efficiently, with SatMamba introducing a novel approach using State Space Models for linear computational scaling, enhancing performance on downstream tasks.

Authors:Xinle Cheng, Zhuoming Chen, Zhihao Jia
Title: CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models
Abstract:
Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge. By employing noise relative magnitude, we identify significant token changes across denoising iterations. Additionally, we enhance token selection by incorporating spatial clustering and ensuring distributional balance. Our experiments demonstrate reveal a 50%-60% reduction in computational costs while preserving the performance of the model, thereby markedly increasing the efficiency of diffusion models. The code is available at https://github.com/ada-cheng/CAT-Pruning
中文摘要:本文提出了一种结合令牌级剪枝与缓存的新颖加速方法,将扩散模型的计算成本降低50%-60%,同时保持其性能表现。
English Summary: This paper introduces a novel acceleration method combining token-level pruning and caching to reduce diffusion models' computational costs by 50%-60% while maintaining performance.

Authors:JiangYong Yu, Sifan Zhou, Dawei Yang, Shuo Wang, Shuoyu Li, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, Zhihang Yuan
Title: MQuant: Unleashing the Inference Potential of Multimodal Large Language Models via Full Static Quantization
Abstract:
Multimodal large language models (MLLMs) have garnered widespread attention due to their ability to understand multimodal input. However, their large parameter sizes and substantial computational demands severely hinder their practical deployment and application.While quantization is an effective way to reduce model size and inference latency, its application to MLLMs remains underexplored. In this paper, we propose MQuant, a post-training quantization (PTQ) framework designed to tackle the unique challenges of multimodal large language models (MLLMs). Conventional quantization often struggles with MLLMs because of (a) high inference latency from large visual token counts, (b) distributional disparities between visual and textual tokens, and (c) extreme outliers introduced by Hadamard-based transformations. To address these issues, MQuant introduces: Modality-Specific Static Quantization (MSQ), assigning distinct static scales for visual vs. textual tokens; Attention-Invariant Flexible Switching (AIFS), reordering tokens to preserve casual attention while eliminating expensive token-wise scale computations; Rotation Magnitude Suppression (RMS), mitigating weight outliers arising from online Hadamard rotations. On five mainstream MLLMs (including Qwen-VL, MiniCPM-V, CogVLM2), MQuant under W4A8 achieves near-floating-point accuracy (<1% degradation) while reducing inference latency by up to 30%, significantly outperforming existing PTQ baselines. Our MQuant effectively bridges the gap for efficient and accurate MLLMs inference in resource-constrained devices. Code has been released in https://github.com/StiphyJay/MQuant.
中文:MQuant是一种后训练量化框架,专门解决多模态大语言模型的独特挑战,在精度损失低于1%的同时将推理延迟降低高达30%,实现在资源受限设备上的高效部署。
English: MQuant is a post-training quantization framework that addresses multimodal large language models' unique challenges, achieving near-floating-point accuracy with less than 1% degradation while cutting inference latency by up to 30% for efficient deployment on resource-limited devices.

Authors:Turi Abu, Ying Shi, Thomas Fang Zheng, Dong Wang
Title: Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language
Abstract:
We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://github.com/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing.
中文: 本研究通过众包收集了100小时的奥罗莫语语音数据集,以解决该语言在自动语音识别中的资源匮乏问题,实验显示通过Conformer和Whisper模型微调可将词错误率优化至10.82%,为奥罗莫语语音处理建立了首批基准性能指标。
English: This study introduces a novel 100-hour Oromo speech dataset collected via crowd-sourcing to address the language's underrepresentation in ASR, demonstrating through Conformer and Whisper model experiments that fine-tuning achieves a competitive 10.82% WER and establishing initial benchmarks for Oromo speech processing.

Authors:Carolin Teuber, Anwai Archit, Constantin Pape
Title: Parameter Efficient Fine-Tuning of Segment Anything Model for Biomedical Imaging
Abstract:
Segmentation is an important analysis task for biomedical images, enabling the study of individual organelles, cells or organs. Deep learning has massively improved segmentation methods, but challenges remain in generalization to new conditions, requiring costly data annotation. Vision foundation models, such as Segment Anything Model (SAM), address this issue through improved generalization. However, these models still require finetuning on annotated data, although with less annotations, to achieve optimal results for new conditions. As a downside, they require more computational resources. This makes parameter-efficient finetuning (PEFT) relevant. We contribute the first comprehensive study of PEFT for SAM applied to biomedical images. We find that the placement of PEFT layers is more important for efficiency than the type of layer for vision transformers and we provide a recipe for resource-efficient finetuning. Our code is publicly available at https://github.com/computational-cell-analytics/peft-sam.
中文摘要:本研究首次系统分析了参数高效微调在生物医学图像分割中Segment Anything模型的应用,发现对于视觉变换器而言,层的位置策略比层类型更能提升效率,并提供了一种资源优化的微调方案。
English Summary: This study presents the first comprehensive analysis of parameter-efficient fine-tuning (PEFT) for the Segment Anything Model in biomedical image segmentation, revealing that strategic layer placement outweighs layer type for efficiency and offering a resource-optimized fine-tuning method.

Authors:Titus Griebel, Anwai Archit, Constantin Pape
Title: Segment Anything for Histopathology
Abstract:
Nucleus segmentation is an important analysis task in digital pathology. However, methods for automatic segmentation often struggle with new data from a different distribution, requiring users to manually annotate nuclei and retrain data-specific models. Vision foundation models (VFMs), such as the Segment Anything Model (SAM), offer a more robust alternative for automatic and interactive segmentation. Despite their success in natural images, a foundation model for nucleus segmentation in histopathology is still missing. Initial efforts to adapt SAM have shown some success, but did not yet introduce a comprehensive model for diverse segmentation tasks. To close this gap, we introduce PathoSAM, a VFM for nucleus segmentation, based on training SAM on a diverse dataset. Our extensive experiments show that it is the new state-of-the-art model for automatic and interactive nucleus instance segmentation in histopathology. We also demonstrate how it can be adapted for other segmentation tasks, including semantic nucleus segmentation. For this task, we show that it yields results better than popular methods, while not yet beating the state-of-the-art, CellViT. Our models are open-source and compatible with popular tools for data annotation. We also provide scripts for whole-slide image segmentation. Our code and models are publicly available at https://github.com/computational-cell-analytics/patho-sam.
中文:PathoSAM是一种基于多样化数据集训练的新型视觉基础模型,在组织病理学中实现了自动和交互式细胞核实例分割的最新性能,同时也能适应其他分割任务。
English: PathoSAM is a new vision foundation model trained on diverse datasets that achieves state-of-the-art performance in automatic and interactive nucleus instance segmentation in histopathology while also being adaptable to other segmentation tasks.

Authors:Karish Grover, Haiyang Yu, Xiang Song, Qi Zhu, Han Xie, Vassilis N. Ioannidis, Christos Faloutsos
Title: Spectro-Riemannian Graph Neural Networks
Abstract:
Can integrating spectral and curvature signals unlock new potential in graph representation learning? Non-Euclidean geometries, particularly Riemannian manifolds such as hyperbolic (negative curvature) and spherical (positive curvature), offer powerful inductive biases for embedding complex graph structures like scale-free, hierarchical, and cyclic patterns. Meanwhile, spectral filtering excels at processing signal variations across graphs, making it effective in homophilic and heterophilic settings. Leveraging both can significantly enhance the learned representations. To this end, we propose Spectro-Riemannian Graph Neural Networks (CUSP) - the first graph representation learning paradigm that unifies both CUrvature (geometric) and SPectral insights. CUSP is a mixed-curvature spectral GNN that learns spectral filters to optimize node embeddings in products of constant-curvature manifolds (hyperbolic, spherical, and Euclidean). Specifically, CUSP introduces three novel components: (a) Cusp Laplacian, an extension of the traditional graph Laplacian based on Ollivier-Ricci curvature, designed to capture the curvature signals better; (b) Cusp Filtering, which employs multiple Riemannian graph filters to obtain cues from various bands in the eigenspectrum; and (c) Cusp Pooling, a hierarchical attention mechanism combined with a curvature-based positional encoding to assess the relative importance of differently curved substructures in our graph. Empirical evaluation across eight homophilic and heterophilic datasets demonstrates the superiority of CUSP in node classification and link prediction tasks, with a gain of up to 5.3% over state-of-the-art models. The code is available at: https://github.com/amazon-science/cusp.
中文: 提出的谱黎曼图神经网络(CUSP)融合了几何曲率和谱滤波方法,显著提升了图表示学习能力,在节点分类和链接预测任务中表现出卓越性能。
English: The proposed Spectro-Riemannian Graph Neural Networks (CUSP) unify geometric curvature and spectral filtering to enhance graph representation learning, demonstrating superior performance in node classification and link prediction tasks.

Authors:Maximilian Leitenstern, Marko Alten, Christian Bolea-Schaser, Dominik Kulmer, Marcel Weinmann, Markus Lienkamp
Title: FlexCloud: Direct, Modular Georeferencing and Drift-Correction of Point Cloud Maps
Abstract:
Current software stacks for real-world applications of autonomous driving leverage map information to ensure reliable localization, path planning, and motion prediction. An important field of research is the generation of point cloud maps, referring to the topic of simultaneous localization and mapping (SLAM). As most recent developments do not include global position data, the resulting point cloud maps suffer from internal distortion and missing georeferencing, preventing their use for map-based localization approaches. Therefore, we propose FlexCloud for an automatic georeferencing of point cloud maps created from SLAM. Our approach is designed to work modularly with different SLAM methods, utilizing only the generated local point cloud map and its odometry. Using the corresponding GNSS positions enables direct georeferencing without additional control points. By leveraging a 3D rubber-sheet transformation, we can correct distortions within the map caused by long-term drift while maintaining its structure. Our approach enables the creation of consistent, globally referenced point cloud maps from data collected by a mobile mapping system (MMS). The source code of our work is available at https://github.com/TUMFTM/FlexCloud.
Chinese: FlexCloud 是一种模块化解决方案,利用GNSS数据自动对SLAM生成的点云地图进行地理配准,并通过三维橡皮筋变换校正畸变,从而为自动驾驶应用提供全局一致的地图。
English: FlexCloud is a modular solution that automatically georeferences SLAM-generated point cloud maps using GNSS data and corrects distortions via 3D rubber-sheet transformation, enabling globally consistent maps for autonomous driving applications.

Authors:Zhichao Sun, Yepeng Liu, Huachao Zhu, Yuliang Gu, Yuda Zou, Zelong Liu, Gui-Song Xia, Bo Du, Yongchao Xu
Title: RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes
Abstract:
Drones have become prevalent robotic platforms with diverse applications, showing significant potential in Embodied Artificial Intelligence (Embodied AI). Referring Expression Comprehension (REC) enables drones to locate objects based on natural language expressions, a crucial capability for Embodied AI. Despite advances in REC for ground-level scenes, aerial views introduce unique challenges including varying viewpoints, occlusions and scale variations. To address this gap, we introduce RefDrone, a REC benchmark for drone scenes. RefDrone reveals three key challenges in REC: 1) multi-scale and small-scale target detection; 2) multi-target and no-target samples; 3) complex environment with rich contextual expressions. To efficiently construct this dataset, we develop RDAgent (referring drone annotation framework with multi-agent system), a semi-automated annotation tool for REC tasks. RDAgent ensures high-quality contextual expressions and reduces annotation cost. Furthermore, we propose Number GroundingDINO (NGDINO), a novel method designed to handle multi-target and no-target cases. NGDINO explicitly learns and utilizes the number of objects referred to in the expression. Comprehensive experiments with state-of-the-art REC methods demonstrate that NGDINO achieves superior performance on both the proposed RefDrone and the existing gRefCOCO datasets. The dataset and code are be publicly at https://github.com/sunzc-sunny/refdrone.
Chinese: RefDrone基准通过引入专门数据集和创新NGDINO方法,解决了无人机视角下指代表达理解的特殊挑战,该方法通过显式数量定位在多个数据集上实现了最优性能。
English: The RefDrone benchmark addresses unique challenges in drone-based Referring Expression Comprehension (REC) by introducing a specialized dataset and the novel NGDINO method, which outperforms existing approaches through explicit number grounding.

Authors:Alexander Nikulin, Ilya Zisman, Denis Tarasov, Nikita Lyubaykin, Andrei Polubarov, Igor Kiselev, Vladislav Kurenkov
Title: Latent Action Learning Requires Supervision in the Presence of Distractors
Abstract:
Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
中文摘要:潜在动作策略(LAPO)在现实干扰物下表现不佳,但改进的LAOM方法将潜在动作质量提升8倍,且仅需2.5%的真实动作监督即可使下游任务性能提高4.2倍,这挑战了传统训练流程。
English Summary: Latent Action Policies (LAPO) struggle with real-world distractors, but the proposed LAOM modification improves latent action quality by 8x and incorporating minimal ground-truth supervision boosts downstream performance by 4.2x, challenging conventional training pipelines.

Authors:Anh-Kiet Duong, Petra Gomez-Krämer
Title: Scalable Framework for Classifying AI-Generated Content Across Modalities
Abstract:
The rapid growth of generative AI technologies has heightened the importance of effectively distinguishing between human and AI-generated content, as well as classifying outputs from diverse generative models. This paper presents a scalable framework that integrates perceptual hashing, similarity measurement, and pseudo-labeling to address these challenges. Our method enables the incorporation of new generative models without retraining, ensuring adaptability and robustness in dynamic scenarios. Comprehensive evaluations on the Defactify4 dataset demonstrate competitive performance in text and image classification tasks, achieving high accuracy across both distinguishing human and AI-generated content and classifying among generative methods. These results highlight the framework's potential for real-world applications as generative AI continues to evolve. Source codes are publicly available at https://github.com/ffyyytt/defactify4.
中文: 本文提出了一种可扩展框架,结合感知哈希、相似度测量和伪标记技术,无需重新训练即可有效区分人类与AI生成内容并分类不同生成模型的输出,在Defactify4数据集上的评估展现了高准确率。
English: This paper introduces a scalable framework that uses perceptual hashing, similarity measurement, and pseudo-labeling to effectively distinguish human from AI-generated content and classify outputs from various generative models without retraining, demonstrating high accuracy in evaluations on the Defactify4 dataset.

Authors:Zhixi Cai, Fucai Ke, Simindokht Jahangard, Maria Garcia de la Banda, Reza Haffari, Peter J. Stuckey, Hamid Rezatofighi
Title: NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
Abstract:
Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines. The code is available at https://github.com/ControlNet/NAVER .
Chinese: 本文提出NAVER组合式视觉定位方法,通过集成概率逻辑推理和自校正机制来增强鲁棒性和可解释性,在复杂推理任务中实现了最先进的性能。
English: This paper introduces NAVER, a compositional visual grounding method that integrates probabilistic logic reasoning and a self-correcting mechanism to enhance robustness and interpretability, achieving state-of-the-art performance in complex reasoning tasks.

Authors:Yu Feng, Yangli-ao Geng, Yifan Zhu, Zongfu Han, Xie Yu, Kaiwen Xue, Haoran Luo, Mengyang Sun, Guangwei Zhang, Meina Song
Title: PM-MOE: Mixture of Experts on Private Model Parameters for Personalized Federated Learning
Abstract:
Federated learning (FL) has gained widespread attention for its privacy-preserving and collaborative learning capabilities. Due to significant statistical heterogeneity, traditional FL struggles to generalize a shared model across diverse data domains. Personalized federated learning addresses this issue by dividing the model into a globally shared part and a locally private part, with the local model correcting representation biases introduced by the global model. Nevertheless, locally converged parameters more accurately capture domain-specific knowledge, and current methods overlook the potential benefits of these parameters. To address these limitations, we propose PM-MoE architecture. This architecture integrates a mixture of personalized modules and an energy-based personalized modules denoising, enabling each client to select beneficial personalized parameters from other clients. We applied the PM-MoE architecture to nine recent model-split-based personalized federated learning algorithms, achieving performance improvements with minimal additional training. Extensive experiments on six widely adopted datasets and two heterogeneity settings validate the effectiveness of our approach. The source code is available at \url{https://github.com/dannis97500/PM-MOE}.
中文摘要:提出的PM-MoE架构通过让客户端选择性整合其他客户端的有利参数,在多个数据集上以最少额外训练实现了个性化联邦学习性能的提升。
English Summary: The proposed PM-MoE architecture enhances personalized federated learning by enabling clients to selectively integrate beneficial parameters from others, achieving improved performance across multiple datasets with minimal extra training.

Authors:Yurui Li, Yuxuan Chen, Li Zhang, Shijian Li, Gang Pan
Title: The Composite Task Challenge for Cooperative Multi-Agent Reinforcement Learning
Abstract:
The significant role of division of labor (DOL) in promoting cooperation is widely recognized in real-world applications.Many cooperative multi-agent reinforcement learning (MARL) methods have incorporated the concept of DOL to improve cooperation among agents.However, the tasks used in existing testbeds typically correspond to tasks where DOL is often not a necessary feature for achieving optimal policies.Additionally, the full utilize of DOL concept in MARL methods remains unrealized due to the absence of appropriate tasks.To enhance the generality and applicability of MARL methods in real-world scenarios, there is a necessary to develop tasks that demand multi-agent DOL and cooperation.In this paper, we propose a series of tasks designed to meet these requirements, drawing on real-world rules as the guidance for their design.We guarantee that DOL and cooperation are necessary condition for completing tasks and introduce three factors to expand the diversity of proposed tasks to cover more realistic situations.We evaluate 10 cooperative MARL methods on the proposed tasks.The results indicate that all baselines perform poorly on these tasks.To further validate the solvability of these tasks, we also propose simplified variants of proposed tasks.Experimental results show that baselines are able to handle these simplified variants, providing evidence of the solvability of the proposed tasks.The source files is available at https://github.com/Yurui-Li/CTC.
Chinese: 本文提出了需要分工与合作的新型多智能体任务,揭示了现有方法的不足,并通过简化版本验证了任务的可解性。
English: This paper introduces new multi-agent tasks requiring division of labor and cooperation, revealing current methods' limitations while demonstrating solvability through simplified variants.

Authors:Yuan Gao, Hao Wu, Ruiqi Shu, Huanshuo Dong, Fan Xu, Rui Ray Chen, Yibo Yan, Qingsong Wen, Xuming Hu, Kun Wang, Jiahao Wu, Qing Li, Hui Xiong, Xiaomeng Huang
Title: OneForecast: A Universal Framework for Global and Regional Weather Forecasting
Abstract:
Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link https://github.com/YuanGao-YG/OneForecast.
Chinese: 本文提出OneForecast框架,通过图神经网络构建全球与区域嵌套的天气预测模型,结合动态系统和多尺度结构,利用自适应消息机制显著提升了极端天气事件的预测精度。
English: This paper introduces OneForecast, a global-regional nested weather forecasting framework using graph neural networks to enhance prediction accuracy, particularly for extreme events, by integrating dynamic systems with multi-scale structures and adaptive messaging.

Authors:Yuan Gao, Hao Wu, Ruiqi Shu, Huanshuo Dong, Fan Xu, Rui Ray Chen, Yibo Yan, Qingsong Wen, Xuming Hu, Kun Wang, Jiahao Wu, Qing Li, Hui Xiong, Xiaomeng Huang
Title: OneForecast: A Universal Framework for Global and Regional Weather Forecasting
Abstract:
Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link https://github.com/YuanGao-YG/OneForecast.
Chinese: 本文提出OneForecast框架,通过图神经网络构建全球与区域嵌套的天气预测模型,结合动态系统和多尺度结构,利用自适应消息机制显著提升了极端天气事件的预测精度。
English: This paper introduces OneForecast, a global-regional nested weather forecasting framework using graph neural networks to enhance prediction accuracy, particularly for extreme events, by integrating dynamic systems with multi-scale structures and adaptive messaging.

Authors:Xin Xu, Qiyun Xu, Tong Xiao, Tianhao Chen, Yuchen Yan, Jiaxin Zhang, Shizhe Diao, Can Yang, Yang Wang
Title: UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and comprehensive benchmark specifically designed to evaluate UnderGraduate-level Physics (UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer correctness of physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the necessity for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning. Codes and data are available at https://github.com/YangLabHKUST/UGPhysics .
中文:UGPhysics是一个专为评估大语言模型在本科物理推理能力而设计的大规模基准测试,通过揭示现有模型的显著性能差距并引入定制化评估方法,旨在推动人工智能在物理推理领域的未来发展。
English: UGPhysics is a comprehensive benchmark designed to evaluate undergraduate-level physics reasoning in large language models, revealing significant performance gaps and introducing a specialized assessment pipeline to advance AI capabilities in this domain.

Authors:Kai Liu, Kaicheng Yang, Zheng Chen, Zhiteng Li, Yong Guo, Wenbo Li, Linghe Kong, Yulun Zhang
Title: BiMaCoSR: Binary One-Step Diffusion Model Leveraging Flexible Matrix Compression for Real Super-Resolution
Abstract:
While super-resolution (SR) methods based on diffusion models (DM) have demonstrated inspiring performance, their deployment is impeded due to the heavy request of memory and computation. Recent researchers apply two kinds of methods to compress or fasten the DM. One is to compress the DM into 1-bit, aka binarization, alleviating the storage and computation pressure. The other distills the multi-step DM into only one step, significantly speeding up inference process. Nonetheless, it remains impossible to deploy DM to resource-limited edge devices. To address this problem, we propose BiMaCoSR, which combines binarization and one-step distillation to obtain extreme compression and acceleration. To prevent the catastrophic collapse of the model caused by binarization, we proposed sparse matrix branch (SMB) and low rank matrix branch (LRMB). Both auxiliary branches pass the full-precision (FP) information but in different ways. SMB absorbs the extreme values and its output is high rank, carrying abundant FP information. Whereas, the design of LRMB is inspired by LoRA and is initialized with the top r SVD components, outputting low rank representation. The computation and storage overhead of our proposed branches can be safely ignored. Comprehensive comparison experiments are conducted to exhibit BiMaCoSR outperforms current state-of-the-art binarization methods and gains competitive performance compared with FP one-step model. BiMaCoSR achieves a 23.8x compression ratio and a 27.4x speedup ratio compared to FP counterpart. Our code and model are available at https://github.com/Kai-Liu001/BiMaCoSR.
Chinese: BiMaCoSR 结合二值化和一步蒸馏技术,通过引入稀疏矩阵分支和低秩矩阵分支保留全精度信息,实现了扩散模型超分辨率的极致压缩与加速,使其能在资源受限的边缘设备上高效部署。
English: BiMaCoSR combines binarization and one-step distillation to achieve extreme compression and acceleration for diffusion model-based super-resolution, overcoming deployment barriers on resource-limited devices with auxiliary branches that preserve full-precision information.

Authors:Chenhui Xu, Dancheng Liu, Yuting Hu, Jiajie Li, Ruiyang Qin, Qingxiao Zheng, Jinjun Xiong
Title: Sub-Sequential Physics-Informed Learning with State Space Model
Abstract:
Physics-Informed Neural Networks (PINNs) are a kind of deep-learning-based numerical solvers for partial differential equations (PDEs). Existing PINNs often suffer from failure modes of being unable to propagate patterns of initial conditions. We discover that these failure modes are caused by the simplicity bias of neural networks and the mismatch between PDE's continuity and PINN's discrete sampling. We reveal that the State Space Model (SSM) can be a continuous-discrete articulation allowing initial condition propagation, and that simplicity bias can be eliminated by aligning a sequence of moderate granularity. Accordingly, we propose PINNMamba, a novel framework that introduces sub-sequence modeling with SSM. Experimental results show that PINNMamba can reduce errors by up to 86.3\% compared with state-of-the-art architecture. Our code is available at https://github.com/miniHuiHui/PINNMamba.
中文: PINNMamba 这一创新框架通过状态空间模型进行子序列建模,解决了物理信息神经网络无法传播初始条件模式的问题,相比现有最优架构可将误差降低高达 86.3%。
English: PINNMamba, a novel framework using State Space Models for sub-sequence modeling, overcomes the limitations of Physics-Informed Neural Networks by enabling initial condition propagation and reducing errors by up to 86.3% compared to existing methods.

Authors:Jihyeok Kim, Seongwoo Moon, Sungwon Nah, David Hyunchul Shim
Title: MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model
Abstract:
This paper proposes novel methods to enhance the performance of monocular 3D object detection models by leveraging the generalized feature extraction capabilities of a vision foundation model. Unlike traditional CNN-based approaches, which often suffer from inaccurate depth estimation and rely on multi-stage object detection pipelines, this study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation. It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner. Specifically, a hierarchical feature fusion block is introduced to extract richer visual features from the foundation model, further enhancing feature extraction capabilities. Depth estimation accuracy is further improved by incorporating a relative depth estimation model trained on large-scale data and fine-tuning it through transfer learning. Additionally, the use of queries in the transformer's decoder, which consider reference points and the dimensions of 2D bounding boxes, enhances recognition performance. The proposed model outperforms recent state-of-the-art methods, as demonstrated through quantitative and qualitative evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments. Code is available at https://github.com/JihyeokKim/MonoDINO-DETR.
中文: 本文提出了一种新颖的单目3D物体检测方法,采用视觉Transformer基础模型和DETR架构,通过分层特征融合提升了深度估计与检测性能,在基准数据集上取得了最优结果。
English: This paper introduces a novel monocular 3D object detection method using a Vision Transformer foundation model and DETR architecture, which improves depth estimation and detection performance through hierarchical feature fusion and achieves state-of-the-art results on benchmark datasets.

Authors:Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara
Title: SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition
Abstract:
In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet coefficients. Our approach exploits the capabilities of wavelets for effective localization in both time and frequency domains. We then combine one-dimensional dilated convolutional neural networks (1D dilated CNN) with a spatial attention layer and bidirectional gated recurrent units (Bi-GRU) with a temporal attention layer to efficiently capture the nuanced spatial and temporal characteristics of emotional features. By handling variable-length speech without segmentation and eliminating the need for pre or post-processing, the proposed model outperformed state-of-the-art methods on IEMOCAP and EMO-DB datasets. The source code of this paper is shared on the Github repository: https://github.com/alaaNfissi/SigWavNet-Learning-Multiresolution-Signal-Wavelet-Network-for-Speech-Emotion-Recognition.
Chinese: 本文提出了一种新颖的端到端深度学习框架,通过结合小波变换和注意力神经网络直接处理原始语音信号,在无需分段或预处理的情况下,在基准数据集上实现了优于现有方法的语音情感识别性能。
English: This paper presents a novel end-to-end deep learning framework for speech emotion recognition that leverages wavelet transforms and attention-based neural networks to directly process raw speech signals, achieving superior performance on benchmark datasets without requiring segmentation or preprocessing.

Authors:Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu
Title: ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Abstract:
Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce ChunkKV, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguistic structures and contextual integrity, ensuring that essential meaning is retained even under aggressive compression. Our innovation includes a novel layer-wise index reuse technique that exploits the higher cross-layer similarity of preserved indices in ChunkKV, reducing computational overhead and improving throughput by 26.5\%. Comprehensive evaluations on challenging benchmarks: LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV demonstrate that ChunkKV outperforms state-of-the-art methods by up to 8.7\% in precision while maintaining the same compression ratio. These results confirm that semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing a simple yet effective solution to the memory bottleneck problem. The code is available at \href{https://github.com/NVIDIA/kvpress}{link}.
中文摘要:ChunkKV提出了一种语义感知的KV缓存压缩方法,将语义块而非单个词元作为基本压缩单元,在保持上下文完整性的同时,将长文本推理性能提升高达8.7%,吞吐量提高26.5%。
English Summary: ChunkKV introduces a semantic-aware KV cache compression method that treats chunks of tokens as basic units, preserving contextual integrity and improving performance by up to 8.7% while boosting throughput by 26.5% in long-context LLM inference.

Authors:Shengyu Feng, Yiming Yang
Title: Regularized Langevin Dynamics for Combinatorial Optimization
Abstract:
This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative paradigm. However, we observe that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA), and the other one based on neural network (NN). Empirical results on three classic CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA- and NN-based solvers. In particular, our SA algorithm reduces the runtime of the previous SOTA SA method by up to 80\%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems. Our code is available at https://github.com/Shengyu-Feng/RLD4CO.
中文: 本研究提出了正则化朗之万动力学(RLD)采样框架,通过避免局部最优来改进组合优化,并开发了两种求解器,在性能和效率上均达到或超越了现有最优方法。
English: This study introduces Regularized Langevin Dynamics (RLD), a sampling framework that enhances combinatorial optimization by preventing local minima, and develops two solvers that match or surpass state-of-the-art methods in performance and efficiency.

Authors:Binchi Zhang, Zaiyi Zheng, Zhengzhang Chen, Jundong Li
Title: Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion
Abstract:
Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on https://github.com/zhengzaiyi/RotationSymmetry.
中文摘要:本文提出旋转对称性,一种适用于Transformer的连续参数空间对称性,推广了离散的置换对称性,并通过最优匹配算法显著提升了跨语言与视觉任务的模型融合效果。
English Summary: The paper introduces rotation symmetry, a continuous parameter space symmetry for transformers that generalizes discrete permutation symmetry, and proposes an optimal matching algorithm to significantly enhance model fusion across language and vision tasks.

Authors:Takumu Fujioka, Gouhei Tanaka
Title: Transformer-Based Vector Font Classification Using Different Font Formats: TrueType versus PostScript
Abstract:
Modern fonts adopt vector-based formats, which ensure scalability without loss of quality. While many deep learning studies on fonts focus on bitmap formats, deep learning for vector fonts remains underexplored. In studies involving deep learning for vector fonts, the choice of font representation has often been made conventionally. However, the font representation format is one of the factors that can influence the computational performance of machine learning models in font-related tasks. Here we show that font representations based on PostScript outlines outperform those based on TrueType outlines in Transformer-based vector font classification. TrueType outlines represent character shapes as sequences of points and their associated flags, whereas PostScript outlines represent them as sequences of commands. In previous research, PostScript outlines have been predominantly used when fonts are treated as part of vector graphics, while TrueType outlines are mainly employed when focusing on fonts alone. Whether to use PostScript or TrueType outlines has been mainly determined by file format specifications and precedent settings in previous studies, rather than performance considerations. To date, few studies have compared which outline format provides better embedding representations. Our findings suggest that information aggregation is crucial in Transformer-based deep learning for vector graphics, as in tokenization in language models and patch division in bitmap-based image recognition models. This insight provides valuable guidance for selecting outline formats in future research on vector graphics.
现代矢量字体中,基于PostScript轮廓的表示在基于Transformer的分类任务中优于TrueType格式,因其具有更出色的信息聚合能力。
Modern vector fonts using PostScript outlines outperform TrueType formats in Transformer-based classification due to superior information aggregation capabilities.

Authors:Yasi Zhang, Oscar Leong
Title: Learning Difference-of-Convex Regularizers for Inverse Problems: A Flexible Framework with Theoretical Guarantees
Abstract:
Learning effective regularization is crucial for solving ill-posed inverse problems, which arise in a wide range of scientific and engineering applications. While data-driven methods that parameterize regularizers using deep neural networks have demonstrated strong empirical performance, they often result in highly nonconvex formulations that lack theoretical guarantees. Recent work has shown that incorporating structured nonconvexity into neural network-based regularizers, such as weak convexity, can strike a balance between empirical performance and theoretical tractability. In this paper, we demonstrate that a broader class of nonconvex functions, difference-of-convex (DC) functions, can yield improved empirical performance while retaining strong convergence guarantees. The DC structure enables the use of well-established optimization algorithms, such as the Difference-of-Convex Algorithm (DCA) and a Proximal Subgradient Method (PSM), which extend beyond standard gradient descent. Furthermore, we provide theoretical insights into the conditions under which optimal regularizers can be expressed as DC functions. Extensive experiments on computed tomography (CT) reconstruction tasks show that our approach achieves strong performance across sparse and limited-view settings, consistently outperforming other weakly supervised learned regularizers. Our code is available at \url{https://github.com/YasminZhang/ADCR}.
Chinese: 本文提出了一种用于逆问题中学习正则化器的凸差函数框架,该框架在CT重建任务中显著提升了实证性能并具备坚实的理论保证,优于现有方法。
English: This paper introduces a difference-of-convex (DC) function framework for learning regularizers in inverse problems, which enhances empirical performance with strong theoretical guarantees and outperforms existing methods in CT reconstruction tasks.

Authors:Akiyoshi Tomihari, Issei Sato
Title: Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers
Abstract:
Transformers are challenging to optimize with SGD and typically require adaptive optimizers such as Adam. However, the reasons behind the superior performance of Adam over SGD remain unclear. In this study, we investigate the optimization of transformers by focusing on gradient heterogeneity, defined as the disparity in gradient norms among parameters. Our analysis shows that gradient heterogeneity hinders gradient-based optimization, including SGD, while sign-based optimization, a simplified variant of Adam, is less affected. We further examine gradient heterogeneity in transformers and show that it is influenced by the placement of layer normalization. Experimental results from fine-tuning transformers in both NLP and vision domains validate our theoretical analyses. This study provides insights into the optimization challenges of transformers and offers guidance for designing future optimization algorithms. Code is available at https://github.com/tom4649/gradient-heterogeneity.
中文: 本研究揭示了梯度异质性(受层归一化位置影响)阻碍了Transformer中SGD的优化,而像Adam这样的基于符号的优化方法更具鲁棒性,并在NLP和视觉任务中得到了实验验证。
English: This study reveals that gradient heterogeneity, influenced by layer normalization placement, hinders SGD optimization in transformers, while sign-based methods like Adam are more robust, with experimental validation across NLP and vision tasks.

Authors:Kefan Dong, Tengyu Ma
Title: STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving
Abstract:
A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration. The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0%, pass@3200), Proofnet-test (23.9%, pass@3200) and PutnamBench (8/644, pass@3200). We release our code, model, and dataset in this URL: https://github.com/kfdong/STP.
Chinese: 自博弈定理证明器(STP)通过让猜想器生成日益复杂的定理、证明器尝试解决它们,有效应对了形式定理证明中高质量训练数据不足的问题,并在多个基准测试中取得了领先性能。
English: The Self-play Theorem Prover (STP) addresses the scarcity of high-quality training data in formal theorem proving by having a conjecturer generate increasingly challenging theorems and a prover solve them, achieving state-of-the-art results on multiple benchmarks.

Authors:Abdurrahim Yilmaz, Furkan Yuceyalcin, Ece Gokyayla, Donghee Choi, Ozan Erdem, Ali Anil Demircali, Rahmetullah Varol, Ufuk Gorkem Kirabali, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran
Title: DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets
Abstract:
A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image--text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image--text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at https://github.com/abdurrahimyilmaz/DermaSynth.
中文: 开发皮肤病学视觉大模型的主要障碍是缺乏大型图像-文本配对数据集,为此我们推出了DermaSynth,这是一个包含92,020对合成图像-文本的数据集,利用先进大语言模型和临床提示生成,旨在支持和加速皮肤病学人工智能研究。
English: The main challenge in developing vision large language models for dermatology is the scarcity of large image-text datasets, which is addressed by the introduction of DermaSynth, a comprehensive synthetic dataset of 92,020 image-text pairs generated using advanced LLMs and clinical prompts to support AI research in the field.

Authors:Mateus de Souza Miranda, Ronny Hänsch, Valdivino Alexandre de Santiago Júnior, Thales Sehn Körting, Erison Carlos dos Santos Monteiro
Title: CerraData-4MM: A multimodal benchmark dataset on Cerrado for land use and land cover classification
Abstract:
The Cerrado faces increasing environmental pressures, necessitating accurate land use and land cover (LULC) mapping despite challenges such as class imbalance and visually similar categories. To address this, we present CerraData-4MM, a multimodal dataset combining Sentinel-1 Synthetic Aperture Radar (SAR) and Sentinel-2 MultiSpectral Imagery (MSI) with 10m spatial resolution. The dataset includes two hierarchical classification levels with 7 and 14 classes, respectively, focusing on the diverse Bico do Papagaio ecoregion. We highlight CerraData-4MM's capacity to benchmark advanced semantic segmentation techniques by evaluating a standard U-Net and a more sophisticated Vision Transformer (ViT) model. The ViT achieves superior performance in multimodal scenarios, with the highest macro F1-score of 57.60% and a mean Intersection over Union (mIoU) of 49.05% at the first hierarchical level. Both models struggle with minority classes, particularly at the second hierarchical level, where U-Net's performance drops to an F1-score of 18.16%. Class balancing improves representation for underrepresented classes but reduces overall accuracy, underscoring the trade-off in weighted training. CerraData-4MM offers a challenging benchmark for advancing deep learning models to handle class imbalance and multimodal data fusion. Code, trained models, and data are publicly available at https://github.com/ai4luc/CerraData-4MM.
中文: CerraData-4MM多模态数据集通过整合卫星影像应对塞拉多地区环境测绘挑战,为深度学习模型提供基准测试平台,其中视觉变换器模型表现优于U-Net,但在少数类别识别上仍存在困难。
English: The CerraData-4MM multimodal dataset addresses environmental mapping challenges in the Cerrado by combining satellite imagery to benchmark deep learning models, with Vision Transformers outperforming U-Net despite persistent difficulties with minority classes.

Authors:Bidossessi Emmanuel Agossou, Marius Pedersen, Kiran Raja, Anuja Vats, PÃ¥l Anders Floor
Title: Influence of color correction on pathology detection in Capsule Endoscopy
Abstract:
Pathology detection in Wireless Capsule Endoscopy (WCE) using deep learning has been explored in the recent past. However, deep learning models can be influenced by the color quality of the dataset used to train them, impacting detection, segmentation and classification tasks. In this work, we evaluate the impact of color correction on pathology detection using two prominent object detection models: Retinanet and YOLOv5. We first generate two color corrected versions of a popular WCE dataset (i.e., SEE-AI dataset) using two different color correction functions. We then evaluate the performance of the Retinanet and YOLOv5 on the original and color corrected versions of the dataset. The results reveal that color correction makes the models generate larger bounding boxes and larger intersection areas with the ground truth annotations. Furthermore, color correction leads to an increased number of false positives for certain pathologies. However, these effects do not translate into a consistent improvement in performance metrics such as F1-scores, IoU, and AP50. The code is available at https://github.com/agossouema2011/WCE2024. Keywords: Wireless Capsule Endoscopy, Color correction, Retinanet, YOLOv5, Detection
中文摘要:本研究评估了颜色校正对无线胶囊内窥镜病理检测的影响,发现尽管颜色校正会扩大检测框并增加假阳性,但并未持续提升关键性能指标。
English Summary: This study evaluates how color correction affects pathology detection in Wireless Capsule Endoscopy using Retinanet and YOLOv5, finding that while it enlarges bounding boxes and increases false positives, it does not consistently improve key performance metrics.

Authors:Dong-Hee Paek, Seung-Hyun Kong
Title: SpikingRTNH: Spiking Neural Network for 4D Radar Object Detection
Abstract:
Recently, 4D Radar has emerged as a crucial sensor for 3D object detection in autonomous vehicles, offering both stable perception in adverse weather and high-density point clouds for object shape recognition. However, processing such high-density data demands substantial computational resources and energy consumption. We propose SpikingRTNH, the first spiking neural network (SNN) for 3D object detection using 4D Radar data. By replacing conventional ReLU activation functions with leaky integrate-and-fire (LIF) spiking neurons, SpikingRTNH achieves significant energy efficiency gains. Furthermore, inspired by human cognitive processes, we introduce biological top-down inference (BTI), which processes point clouds sequentially from higher to lower densities. This approach effectively utilizes points with lower noise and higher importance for detection. Experiments on K-Radar dataset demonstrate that SpikingRTNH with BTI significantly reduces energy consumption by 78% while achieving comparable detection performance to its ANN counterpart (51.1% AP 3D, 57.0% AP BEV). These results establish the viability of SNNs for energy-efficient 4D Radar-based object detection in autonomous driving systems. All codes are available at https://github.com/kaist-avelab/k-radar.
Chinese: SpikingRTNH是首个利用4D雷达数据进行3D物体检测的脉冲神经网络,通过仿生自上而下推理机制,在保持同等检测性能的同时实现了78%的能耗降低。
English: SpikingRTNH is the first spiking neural network for 3D object detection using 4D Radar data, achieving 78% energy reduction while maintaining comparable detection performance through biological top-down inference.

Authors:Soon Jynn Chu, Nalaka Amarasiri, Sandesh Giri, Priyata Kafle
Title: Blood Glucose Level Prediction in Type 1 Diabetes Using Machine Learning
Abstract:
Type 1 Diabetes is a chronic autoimmune condition in which the immune system attacks and destroys insulin-producing beta cells in the pancreas, resulting in little to no insulin production. Insulin helps glucose in your blood enter your muscle, fat, and liver cells so they can use it for energy or store it for later use. If insulin is insufficient, it causes sugar to build up in the blood and leads to serious health problems. People with Type 1 Diabetes need synthetic insulin every day. In diabetes management, continuous glucose monitoring is an important feature that provides near real-time blood glucose data. It is useful in deciding the synthetic insulin dose. In this research work, we used machine learning tools, deep neural networks, deep reinforcement learning, and voting and stacking regressors to predict blood glucose levels at 30-min time intervals using the latest DiaTrend dataset. Predicting blood glucose levels is useful in better diabetes management systems. The trained models were compared using several evaluation metrics. Our evaluation results demonstrate the performance of various models across different glycemic conditions for blood glucose prediction. The source codes of this work can be found in: https://github.com/soon-jynn-chu/t1d_bg_prediction
中文: 本研究应用机器学习技术,基于DiaTrend数据集以30分钟为间隔预测血糖水平,旨在通过改进血糖监测和胰岛素剂量决策来优化糖尿病管理。
English: This research applies machine learning techniques to predict blood glucose levels at 30-minute intervals using the DiaTrend dataset, aiming to improve diabetes management through enhanced glucose monitoring and insulin dosing decisions.

Authors:Yaxi Lu, Haolun Li, Xin Cong, Zhong Zhang, Yesai Wu, Yankai Lin, Zhiyuan Liu, Fangming Liu, Maosong Sun
Title: Learning to Generate Structured Output with Schema Reinforcement Learning
Abstract:
This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models' abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models' understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.
中文: 本研究探讨大型语言模型根据模式生成有效JSON输出的能力,揭示了其当前局限性,并提出结合细粒度验证器的强化学习方法,显著提升了模型性能。
English: This study examines large language models' ability to generate valid JSON outputs against schemas, revealing their current limitations and proposing a reinforcement learning approach with a fine-grained validator that significantly improves performance.

Authors:Yu Xia, Jingru Fan, Weize Chen, Siyu Yan, Xin Cong, Zhong Zhang, Yaxi Lu, Yankai Lin, Zhiyuan Liu, Maosong Sun
Title: AgentRM: Enhancing Agent Generalization with Reward Modeling
Abstract:
Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model. Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search. We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge. We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search. On four types of nine agent tasks, AgentRM enhances the base policy model by $8.8$ points on average, surpassing the top general agent by $4.0$. Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement of $12.6$ on LLaMA-3-70B policy model. As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by $11.4$ on three held-in tasks. Further analysis verifies its effectiveness in test-time scaling. Codes will be released to facilitate the research in this area.
Chinese: 本研究提出AgentRM,一种可泛化的奖励模型,通过测试时搜索指导策略模型,在多种智能体任务上显著提升性能,展现出强大的泛化与专业化能力,优于直接微调策略模型的方法。
English: This study introduces AgentRM, a generalizable reward model that outperforms direct policy fine-tuning by guiding the policy model through test-time search, achieving significant performance gains across diverse agent tasks and demonstrating robust generalization and specialization capabilities.

Authors:Kang Fu, Huiyu Duan, Zicheng Zhang, Xiaohong Liu, Xiongkuo Min, Jia Wang, Guangtao Zhai
Title: Multi-Dimensional Quality Assessment for Text-to-3D Assets: Dataset and Model
Abstract:
Recent advancements in text-to-image (T2I) generation have spurred the development of text-to-3D asset (T23DA) generation, leveraging pretrained 2D text-to-image diffusion models for text-to-3D asset synthesis. Despite the growing popularity of text-to-3D asset generation, its evaluation has not been well considered and studied. However, given the significant quality discrepancies among various text-to-3D assets, there is a pressing need for quality assessment models aligned with human subjective judgments. To tackle this challenge, we conduct a comprehensive study to explore the T23DA quality assessment (T23DAQA) problem in this work from both subjective and objective perspectives. Given the absence of corresponding databases, we first establish the largest text-to-3D asset quality assessment database to date, termed the AIGC-T23DAQA database. This database encompasses 969 validated 3D assets generated from 170 prompts via 6 popular text-to-3D asset generation models, and corresponding subjective quality ratings for these assets from the perspectives of quality, authenticity, and text-asset correspondence, respectively. Subsequently, we establish a comprehensive benchmark based on the AIGC-T23DAQA database, and devise an effective T23DAQA model to evaluate the generated 3D assets from the aforementioned three perspectives, respectively.
中文: 文本到3D资产生成的进展凸显了与人类主观判断一致的质量评估需求,为此建立了AIGC-T23DAQA数据库和基准模型,分别从质量、真实性和文本-资产对应性三个方面进行评估。
English: Recent advances in text-to-3D generation highlight the need for quality assessment aligned with human judgment, leading to the creation of the AIGC-T23DAQA database and a benchmark model evaluating quality, authenticity, and text-asset correspondence.

Authors:Wenwen Xie, Geng Sun, Jiacheng Wang, Hongyang Du, Jiawen Kang, Dusit Niyato, Kaibin Huang, Victor C. M. Leung
Title: Multi-objective Low-altitude IRS-assisted ISAC Optimization via Generative AI-enhanced Deep Reinforcement Learning
Abstract:
Integrated sensing and communication (ISAC) has garnered substantial research interest owing to its pivotal role in advancing the development of next-generation (6G) wireless networks. However, achieving a performance balance between communication and sensing in the dual-function radar communication (DFRC)-based ISAC system remains a significant challenge. In this paper, a low-altitude intelligent reflecting surface (IRS)-assisted ISAC system is explored, where a base station (BS) supports dual-functional operations, enabling both data transmission for multiple users and sensing for a blocked target, with the channel quality enhanced by an IRS mounted on the unmanned aerial vehicle (UAV). Moreover, we formulate an integrated communication, sensing, and energy efficiency multi-objective optimization problem (CSEMOP), which aims to maximize the communication rate of the users and the sensing rate of the target, while minimizing UAV propulsion energy consumption by jointly optimizing the BS beamforming matrix, IRS phase shifts, the flight velocity and angle of the UAV. Considering the non-convexity, trade-off, and dynamic nature of the formulated CSEMOP, we propose a generative diffusion model-based deep deterministic policy gradient (GDMDDPG) algorithm to solve the problem. Specifically, the diffusion model is incorporated into the actor network of DDPG to improve the action quality, with noise perturbation mechanism for better exploration and recent prioritized experience replay (RPER) sampling mechanism for enhanced training efficiency. Simulation results indicate that the GDMDDPG algorithm delivers superior performance compared to the existing methods.
中文: 本文针对无人机载智能反射面辅助的通信感知一体化系统,提出一种基于生成扩散模型的深度强化学习算法,通过联合优化基站波束成形和无人机飞行参数,在提升通信速率与感知性能的同时降低能耗,实验表明其性能优于现有方法。
English: This paper proposes a generative diffusion model-enhanced deep reinforcement learning algorithm to optimize communication rates, sensing performance, and energy efficiency in a UAV-mounted IRS-assisted integrated sensing and communication system, demonstrating superior performance over existing methods.

Authors:Jiawen Kang, Jiana Liao, Runquan Gao, Jinbo Wen, Huawei Huang, Maomao Zhang, Changyan Yi, Tao Zhang, Dusit Niyato, Zibin Zheng
Title: Efficient and Trustworthy Block Propagation for Blockchain-enabled Mobile Embodied AI Networks: A Graph Resfusion Approach
Abstract:
By synergistically integrating mobile networks and embodied artificial intelligence (AI), Mobile Embodied AI Networks (MEANETs) represent an advanced paradigm that facilitates autonomous, context-aware, and interactive behaviors within dynamic environments. Nevertheless, the rapid development of MEANETs is accompanied by challenges in trustworthiness and operational efficiency. Fortunately, blockchain technology, with its decentralized and immutable characteristics, offers promising solutions for MEANETs. However, existing block propagation mechanisms suffer from challenges such as low propagation efficiency and weak security for block propagation, which results in delayed transmission of vehicular messages or vulnerability to malicious tampering, potentially causing severe traffic accidents in blockchain-enabled MEANETs. Moreover, current block propagation strategies cannot effectively adapt to real-time changes of dynamic topology in MEANETs. Therefore, in this paper, we propose a graph Resfusion model-based trustworthy block propagation optimization framework for consortium blockchain-enabled MEANETs. Specifically, we propose an innovative trust calculation mechanism based on the trust cloud model, which comprehensively accounts for randomness and fuzziness in the miner trust evaluation. Furthermore, by leveraging the strengths of graph neural networks and diffusion models, we develop a graph Resfusion model to effectively and adaptively generate the optimal block propagation trajectory. Simulation results demonstrate that the proposed model outperforms other routing mechanisms in terms of block propagation efficiency and trustworthiness. Additionally, the results highlight its strong adaptability to dynamic environments, making it particularly suitable for rapidly changing MEANETs.
中文: 本文针对移动体智能网络在信任和效率方面的挑战,提出基于图Resfusion模型的优化框架,通过创新信任计算机制和自适应传播轨迹生成,显著提升了区块链在动态环境中的传播可靠性与适应性。
English: MEANETs integrate mobile networks and embodied AI for autonomous behaviors but face trust and efficiency challenges, which this paper addresses by proposing a graph Resfusion model-based framework that enhances block propagation trustworthiness and adaptability in dynamic environments.

Authors:Xiaohuan Li, Shaowen Qin, Xin Tang, Jiawen Kang, Jin Ye, Zhonghua Zhao, Yusi Zheng, Dusit Niyato
Title: Meta-Computing Enhanced Federated Learning in IIoT: Satisfaction-Aware Incentive Scheme via DRL-Based Stackelberg Game
Abstract:
The Industrial Internet of Things (IIoT) leverages Federated Learning (FL) for distributed model training while preserving data privacy, and meta-computing enhances FL by optimizing and integrating distributed computing resources, improving efficiency and scalability. Efficient IIoT operations require a trade-off between model quality and training latency. Consequently, a primary challenge of FL in IIoT is to optimize overall system performance by balancing model quality and training latency. This paper designs a satisfaction function that accounts for data size, Age of Information (AoI), and training latency for meta-computing. Additionally, the satisfaction function is incorporated into the utility functions to incentivize nodes in IIoT participation in model training. We model the utility functions of servers and nodes as a two-stage Stackelberg game and employ a deep reinforcement learning approach to learn the Stackelberg equilibrium. This approach ensures balanced rewards and enhances the applicability of the incentive scheme for IIoT. Simulation results demonstrate that, under the same budget constraints, the proposed incentive scheme improves utility by at least 23.7% compared to existing FL schemes without compromising model accuracy.
中文摘要:本文设计了一种基于两阶段Stackelberg博弈的激励机制,通过深度强化学习平衡工业物联网联邦学习中的模型质量与训练延迟,在相同预算下相比现有方案至少提升23.7%的效用且不降低模型精度。
English Summary: This paper proposes a two-stage Stackelberg game-based incentive scheme using deep reinforcement learning to optimize the trade-off between model quality and training latency in IIoT federated learning, achieving at least 23.7% utility improvement without sacrificing accuracy under budget constraints.

Authors:Geng Sun, Jian Xiao, Jiahui Li, Jiacheng Wang, Jiawen Kang, Dusit Niyato, Shiwen Mao
Title: Aerial Reliable Collaborative Communications for Terrestrial Mobile Users via Evolutionary Multi-Objective Deep Reinforcement Learning
Abstract:
Unmanned aerial vehicles (UAVs) have emerged as the potential aerial base stations (BSs) to improve terrestrial communications. However, the limited onboard energy and antenna power of a UAV restrict its communication range and transmission capability. To address these limitations, this work employs collaborative beamforming through a UAV-enabled virtual antenna array to improve transmission performance from the UAV to terrestrial mobile users, under interference from non-associated BSs and dynamic channel conditions. Specifically, we introduce a memory-based random walk model to more accurately depict the mobility patterns of terrestrial mobile users. Following this, we formulate a multi-objective optimization problem (MOP) focused on maximizing the transmission rate while minimizing the flight energy consumption of the UAV swarm. Given the NP-hard nature of the formulated MOP and the highly dynamic environment, we transform this problem into a multi-objective Markov decision process and propose an improved evolutionary multi-objective reinforcement learning algorithm. Specifically, this algorithm introduces an evolutionary learning approach to obtain the approximate Pareto set for the formulated MOP. Moreover, the algorithm incorporates a long short-term memory network and hyper-sphere-based task selection method to discern the movement patterns of terrestrial mobile users and improve the diversity of the obtained Pareto set. Simulation results demonstrate that the proposed method effectively generates a diverse range of non-dominated policies and outperforms existing methods. Additional simulations demonstrate the scalability and robustness of the proposed CB-based method under different system parameters and various unexpected circumstances.
中文摘要:本研究通过无人机群协同波束成形提升空地传输性能,并采用改进的进化多目标强化学习算法,在动态环境中实现传输速率最大化与能耗最小化的多目标优化。
English Summary: This study enhances UAV-to-user transmission by employing collaborative beamforming in a UAV swarm and introduces an evolutionary multi-objective reinforcement learning algorithm to optimize transmission rates while minimizing energy consumption under dynamic conditions.

Authors:Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao
Title: Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs
Abstract:
Knowledge editing allows for efficient adaptation of large language models (LLMs) to new information or corrections without requiring full retraining. However, prior methods typically focus on either single-language editing or basic multilingual editing, failing to achieve true cross-linguistic knowledge synchronization. To address this, we present a simple and practical state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), designed to propagate knowledge from a dominant language to other languages effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel dataset to modify in-scope knowledge while preserving unrelated information, and (ii) Target-language Preference Optimization (TL-PO), which applies advanced optimization techniques to ensure consistency across languages, fostering the transfer of updates. Additionally, we contribute a high-quality, cross-lingual dataset, specifically designed to enhance knowledge transfer across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks show that X-KDE significantly enhances cross-lingual performance, achieving an average improvement of +8.19%, while maintaining high accuracy in monolingual settings.
中文: 本研究提出了跨语言知识民主编辑方法(X-KDE),通过跨语言编辑指令调优和目标语言偏好优化两阶段设计,有效实现主要语言知识向其他语言的传播,在跨语言基准测试中平均性能提升8.19%,同时保持单语言场景的高准确度。
English: This study introduces Cross-Lingual Knowledge Democracy Edit (X-KDE), a two-stage method that enhances cross-lingual knowledge synchronization in large language models by propagating updates from a dominant language to others, achieving an average performance improvement of +8.19% on benchmarks while maintaining monolingual accuracy.

Authors:Peng Wang, Shengchao Hu, Zerui Tao, Guoxia Wang, Dianhai Yu, Li Shen, Quan Zheng, Dacheng Tao
Title: SeWA: Selective Weight Average via Probabilistic Masking
Abstract:
Weight averaging has become a standard technique for enhancing model performance. However, methods such as Stochastic Weight Averaging (SWA) and Latest Weight Averaging (LAWA) often require manually designed procedures to sample from the training trajectory, and the results depend heavily on hyperparameter tuning. To minimize human effort, this paper proposes a simple yet efficient algorithm called Selective Weight Averaging (SeWA), which adaptively selects checkpoints during the final stages of training for averaging. Based on SeWA, we show that only a few points are needed to achieve better generalization and faster convergence. Theoretically, solving the discrete subset selection problem is inherently challenging. To address this, we transform it into a continuous probabilistic optimization framework and employ the Gumbel-Softmax estimator to learn the non-differentiable mask for each checkpoint. Further, we theoretically derive the SeWA's stability-based generalization bounds, which are sharper than that of SGD under both convex and non-convex assumptions. Finally, solid extended experiments in various domains, including behavior cloning, image classification, and text classification, further validate the effectiveness of our approach.
中文: 本文提出选择性权重平均(SeWA)算法,能在训练后期自适应选择检查点进行平均,仅需少量点即可实现更好的泛化性能和更快收敛,无需大量人工调参。
English: This paper introduces Selective Weight Averaging (SeWA), an adaptive algorithm that automatically selects checkpoints in the final training phase for averaging, achieving superior generalization and faster convergence with minimal manual tuning.

Authors:Yuzhu Chen, Yingjie Wang, Shi Fu, Li Shen, Yongcheng Jing, Xinmei Tian, Dacheng Tao
Title: HRP: High-Rank Preheating for Superior LoRA Initialization
Abstract:
This paper studies the crucial impact of initialization in Low-Rank Adaptation (LoRA). Through theoretical analysis, we demonstrate that the fine-tuned result of LoRA is highly sensitive to initialization, which is likely to lead suboptimal low-rank results. While this issue can be mitigated by adjusting the initial direction towards the main singular vectors of the target $ΔW$, which is, however, typically unknown in real-world scenarios. To approximate this initial direction, we propose High-Rank Preheating (HRP), which first trains LoRA with a higher preheating rank for a few steps, then uses the main singular vectors of the derived $BA^\top$ as initialization for the main fine-tuning process. With only a modification in the initial direction, we prove that HRP makes LoRA achieve better fine-tuned results than random initialization in expectation, and the enhancement grows with the preheating rank. We validate our theoretical findings through extensive experiments in various models and tasks, where HRP significantly enhances LoRA's effectiveness and outperforms other initialization strategies and other LoRA variants.
中文: 本文揭示了低秩适应(LoRA)对初始化的敏感性,并提出高秩预热(HRP)方法,通过高秩预训练优化初始方向,显著提升了LoRA在不同模型和任务中的微调效果。
English: This paper highlights the sensitivity of Low-Rank Adaptation (LoRA) to initialization and introduces High-Rank Preheating (HRP), a method that improves fine-tuning results by optimizing initial directions through a higher-rank pre-training phase, validated across diverse models and tasks.

Authors:Qixin Zhang, Zongqi Wan, Yu Yang, Li Shen, Dacheng Tao
Title: Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency
Abstract:
Coordinating multiple agents to collaboratively maximize submodular functions in unpredictable environments is a critical task with numerous applications in machine learning, robot planning and control. The existing approaches, such as the OSG algorithm, are often hindered by their poor approximation guarantees and the rigid requirement for a fully connected communication graph. To address these challenges, we firstly present a $\textbf{MA-OSMA}$ algorithm, which employs the multi-linear extension to transfer the discrete submodular maximization problem into a continuous optimization, thereby allowing us to reduce the strict dependence on a complete graph through consensus techniques. Moreover, $\textbf{MA-OSMA}$ leverages a novel surrogate gradient to avoid sub-optimal stationary points. To eliminate the computationally intensive projection operations in $\textbf{MA-OSMA}$, we also introduce a projection-free $\textbf{MA-OSEA}$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution. Theoretically, we confirm that both algorithms achieve a regret bound of $\widetilde{O}(\sqrt{\frac{C_{T}T}{1-β}})$ against a $(\frac{1-e^{-c}}{c})$-approximation to the best comparator in hindsight, where $C_{T}$ is the deviation of maximizer sequence, $β$ is the spectral gap of the network and $c$ is the joint curvature of submodular objectives. This result significantly improves the $(\frac{1}{1+c})$-approximation provided by the state-of-the-art OSG algorithm. Finally, we demonstrate the effectiveness of our proposed algorithms through simulation-based multi-target tracking.
Chinese: 本文提出了MA-OSMA算法及其无投影版本MA-OSEA,通过多线性扩展和共识技术显著提升了动态环境下多智能体协同优化子模函数的性能,在近似保证和通信要求方面均优于现有OSG算法。
English: This paper introduces two novel algorithms, MA-OSMA and its projection-free variant MA-OSEA, which significantly enhance multi-agent coordination for submodular maximization in dynamic environments by improving approximation guarantees and reducing communication constraints compared to existing methods like OSG.

Authors:Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Dacheng Tao, Minhao Cheng
Title: Safety Reasoning with Guidelines
Abstract:
Training safe LLMs remains a critical challenge. The most widely used method, Refusal Training (RT), struggles to generalize against various Out-of-Distribution (OOD) jailbreaking attacks. Although various advanced methods have been proposed to address this issue, we instead question whether OOD attacks inherently surpass the capability of vanilla RT. Evaluations using Best-of-N (BoN) reveal significant safety improvements as N increases, indicating models possess adequate latent safety knowledge but RT fails to consistently elicit it under OOD scenarios. Further domain adaptation analysis reveals that direct RT causes reliance on superficial shortcuts, resulting in non-generalizable representation mappings. Inspired by our findings, we propose training model to perform safety reasoning for each query. Specifically, we synthesize reasoning supervision aligned with specified guidelines that reflect diverse perspectives on safety knowledge. This encourages model to engage in deeper reasoning, explicitly eliciting and utilizing latent safety knowledge for each query. Extensive experiments show that our method significantly improves model generalization against OOD attacks.
中文: 本研究通过揭示模型具备潜在安全知识,挑战了分布外越狱攻击必然超越基础拒绝训练的假设,并提出一种安全推理方法,通过合成对齐的监督指导来显著提升模型对此类攻击的泛化能力。
English: This study challenges the assumption that Out-of-Distribution jailbreaking attacks inherently surpass vanilla Refusal Training by revealing that models possess latent safety knowledge, and proposes a safety reasoning method that synthesizes aligned supervision to significantly enhance generalization against such attacks.

Authors:Yuheng Chen, Pengfei Cao, Kang Liu, Jun Zhao
Title: The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
Abstract:
Previous studies primarily utilize MLP neurons as units of analysis for understanding the mechanisms of factual knowledge in Language Models (LMs); however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. In this paper, we first conduct preliminary experiments to validate that Sparse Autoencoders (SAE) can effectively decompose neurons into features, which serve as alternative analytical units. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Features achieve better privacy protection than neurons, demonstrated through our proposed FeatureEdit method, which significantly outperforms existing neuron-based approaches in erasing privacy-sensitive information from LMs.Code and dataset will be available.
中文摘要:本研究证明稀疏自编码器可将神经元分解为特征,这些特征在知识表达、可解释性、单义性和通过FeatureEdit方法实现的隐私保护方面均优于神经元。
English Summary: This study demonstrates that Sparse Autoencoders can decompose neurons into features, which surpass neurons in knowledge expression, interpretability, monosemanticity, and privacy protection through the proposed FeatureEdit method.

Authors:Huanxuan Liao, Shizhu He, Yupu Hao, Jun Zhao, Kang Liu
Title: DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning
Abstract:
Continual learning (CL) is essential for Large Language Models (LLMs) to adapt to evolving real-world demands, yet they are susceptible to catastrophic forgetting (CF). While traditional CF solutions rely on expensive data rehearsal, recent rehearsal-free methods employ model-based and regularization-based strategies to address this issue. However, these approaches often neglect the model's plasticity, which is crucial to achieving optimal performance on newly learned tasks. Consequently, a key challenge in CL is striking a balance between preserving plasticity and mitigating CF. To tackle this challenge, we propose the $\textbf{D}$ecomposed $\textbf{A}$ttention-based $\textbf{T}$ask $\textbf{A}$daptation (DATA), which explicitly decouples and learns both task-specific and task-shared knowledge using high-rank and low-rank task adapters (e.g., LoRAs). For new tasks, DATA dynamically adjusts the weights of adapters of different ranks based on their relevance and distinction from previous tasks, allowing the model to acquire new task-specific skills while effectively retaining previously learned knowledge. Specifically, we implement a decomposed component weighting strategy comprising learnable components that collectively generate attention-based weights, allowing the model to integrate and utilize diverse knowledge from each DATA. Extensive experiments on three widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance. Notably, our approach significantly enhances model plasticity and mitigates CF by extending learnable components and employing stochastic restoration during training iterations.
中文摘要:提出的分解注意力任务适应(DATA)方法通过基于任务相关性动态调整高秩与低秩任务适配器的权重,在持续学习中有效平衡模型可塑性并缓解灾难性遗忘,在多个基准测试中实现了最优性能。
English Summary: The proposed Decomposed Attention-based Task Adaptation (DATA) method effectively balances plasticity and mitigates catastrophic forgetting in continual learning by dynamically weighting high-rank and low-rank task adapters based on task relevance, achieving state-of-the-art performance across benchmarks.

Authors:Kun Luo, Zheng Liu, Peitian Zhang, Hongjin Qian, Jun Zhao, Kang Liu
Title: Does RAG Really Perform Bad For Long-Context Processing?
Abstract:
The efficient processing of long context poses a serious challenge for large language models (LLMs). Recently, retrieval-augmented generation (RAG) has emerged as a promising strategy for this problem, as it enables LLMs to make selective use of the long context for efficient computation. However, existing RAG approaches lag behind other long-context processing methods due to inherent limitations on inaccurate retrieval and fragmented contexts. To address these challenges, we introduce RetroLM, a novel RAG framework for long-context processing. Unlike traditional methods, RetroLM employs KV-level retrieval augmentation, where it partitions the LLM's KV cache into contiguous pages and retrieves the most crucial ones for efficient computation. This approach enhances robustness to retrieval inaccuracy, facilitates effective utilization of fragmented contexts, and saves the cost from repeated computation. Building on this framework, we further develop a specialized retriever for precise retrieval of critical pages and conduct unsupervised post-training to optimize the model's ability to leverage retrieved information. We conduct comprehensive evaluations with a variety of benchmarks, including LongBench, InfiniteBench, and RULER, where RetroLM significantly outperforms existing long-context LLMs and efficient long-context processing methods, particularly in tasks requiring intensive reasoning or extremely long-context comprehension.
中文: RetroLM提出了一种新颖的检索增强生成框架,通过KV级检索机制提升大语言模型的长文本处理能力,在增强鲁棒性和效率的同时,在复杂推理和超长文本理解任务中显著优于现有方法。
English: RetroLM introduces a novel retrieval-augmented generation framework that enhances long-context processing in LLMs by employing KV-level retrieval, improving robustness and efficiency while outperforming existing methods in reasoning and comprehension tasks.

Authors:Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen
Title: Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models
Abstract:
Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would leave the gap in inheriting the task-solving capabilities from the backbone LLMs, and make it costly to collect a large-scale dataset. To address it, we propose ViFT, a visual instruction-free fine-tuning framework for LVLMs. In ViFT, we only require the text-only instructions and image caption data during training, to separately learn the task-solving and visual perception abilities. During inference, we extract and combine the representations of the text and image inputs, for fusing the two abilities to fulfill multimodal tasks. Experimental results demonstrate that ViFT can achieve state-of-the-art performance on several visual reasoning and visual instruction following benchmarks, with rather less training data. Our code and data will be publicly released.
中文摘要:提出的ViFT框架通过纯文本指令和图像描述训练大型视觉语言模型,以较少训练数据实现了多模态任务处理能力的先进性能。
English Summary: The proposed ViFT framework enables large vision-language models to acquire multimodal task-solving capabilities through text-only instructions and image captions, achieving state-of-the-art performance with reduced training data.

Authors:Yinqiu Liu, Ruichen Zhang, Jiacheng Wang, Dusit Niyato, Xianbin Wang, Dong In Kim, Hongyang Du
Title: Intelligent Mobile AI-Generated Content Services via Interactive Prompt Engineering and Dynamic Service Provisioning
Abstract:
Due to massive computational demands of large generative models, AI-Generated Content (AIGC) can organize collaborative Mobile AIGC Service Providers (MASPs) at network edges to provide ubiquitous and customized content generation for resource-constrained users. However, such a paradigm faces two significant challenges: 1) raw prompts (i.e., the task description from users) often lead to poor generation quality due to users' lack of experience with specific AIGC models, and 2) static service provisioning fails to efficiently utilize computational and communication resources given the heterogeneity of AIGC tasks. To address these challenges, we propose an intelligent mobile AIGC service scheme. Firstly, we develop an interactive prompt engineering mechanism that leverages a Large Language Model (LLM) to generate customized prompt corpora and employs Inverse Reinforcement Learning (IRL) for policy imitation through small-scale expert demonstrations. Secondly, we formulate a dynamic mobile AIGC service provisioning problem that jointly optimizes the number of inference trials and transmission power allocation. Then, we propose the Diffusion-Enhanced Deep Deterministic Policy Gradient (D3PG) algorithm to solve the problem. By incorporating the diffusion process into Deep Reinforcement Learning (DRL) architecture, the environment exploration capability can be improved, thus adapting to varying mobile AIGC scenarios. Extensive experimental results demonstrate that our prompt engineering approach improves single-round generation success probability by 6.3 times, while D3PG increases the user service experience by 67.8% compared to baseline DRL approaches.
中文摘要:针对移动AI生成内容服务的挑战,本研究提出了一种结合大型语言模型和逆强化学习的交互式提示工程机制,以及采用扩散增强深度强化学习算法优化的动态服务配置方案,显著提升了单轮生成成功率和用户服务体验。
English Summary: To enhance mobile AI-generated content services, this study introduces an interactive prompt engineering mechanism using large language models and inverse reinforcement learning, alongside a dynamic service provisioning approach optimized by a novel diffusion-enhanced deep reinforcement learning algorithm that significantly improves generation success rates and user experience.

Authors:Xin Zhou, Yiwen Guo, Ruotian Ma, Tao Gui, Qi Zhang, Xuanjing Huang
Title: Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models
Abstract:
Aligning Large Language Models (LLMs) with human preferences is crucial for their deployment in real-world applications. Recent advancements in Self-Rewarding Language Models suggest that an LLM can use its internal reward models (such as LLM-as-a-Judge) \cite{yuanself} to generate preference data, improving alignment performance without costly human annotation. However, we find that different internal reward models within the same LLM often generate inconsistent preferences. This inconsistency raises concerns about the reliability of self-generated preference data, hinders overall alignment performance, and highlights the need for further research to ensure reliable and coherent alignment with human preferences. To address this limitation, we propose Self-Consistent Internal Rewards (SCIR), a novel framework designed to enhance consistency among internal reward models during training. In each training step, we collect preference predictions from multiple pre-defined internal reward models and enforce consistency and confidence through an inconsistency penalty mechanism, thereby improving the reliability of these internal reward models. We selectively use data with consistent predictions for preference optimization, ensuring the quality of the preference data. By employing self-consistent internal rewards, our method significantly improves the alignment performance and reward modeling capability of LLMs, outperforming baseline methods by a notable margin.
中文摘要:SCIR框架通过不一致性惩罚机制提升内部奖励模型的一致性,从而显著提高大语言模型的对齐性能,明显优于基准方法。
English Summary: The SCIR framework enhances LLM alignment by improving consistency among internal reward models through an inconsistency penalty mechanism, significantly boosting performance over baseline methods.

Authors:Changhao Jiang, Ming Zhang, Junjie Ye, Xiaoran Fan, Yifei Cao, Jiajun Sun, Zhiheng Xi, Shihan Dou, Yi Dong, Yujiong Shen, Jingqi Tong, Baoyu Fan, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
Title: Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
Abstract:
The GPT-4 technical report highlights the possibility of predicting model performance on downstream tasks using only pre-training signals, though detailed methodologies are absent. Such predictive capabilities are essential for resource-efficient pre-training and the construction of task-aligned datasets. In this paper, we aim to predict performance in closed-book question answering (QA), a vital downstream task that directly reflects a model's internalized knowledge without the help of external tools. We address three primary challenges: (1) limited access to and understanding of pre-training corpora, (2) limitations of current evaluation methods for pre-trained models, and (3) limitations of frequency-based metrics in predicting model performance. In response, we conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models. We then develop a multi-template QA evaluation framework incorporating paraphrased question variants. Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics, model size, and QA accuracy, without requiring additional training. Experimental results show that SMI outperforms co-occurrence-based baselines, achieving $R^2 > 0.75$ on models with over one billion parameters. Theoretical analysis further suggests an upper bound of around 80% QA accuracy under optimal pre-training, reflecting intrinsic memory limitations and motivating the use of retrieval or few-shot methods in later stages.
Chinese: 本研究提出规模依赖互信息(SMI)这一新指标,通过关联预训练数据特征与模型规模来有效预测闭卷问答性能,实验成果显著并揭示了模型固有的记忆局限性。
English: This study introduces Size-dependent Mutual Information (SMI), a novel metric that effectively predicts closed-book question answering performance by correlating pre-training data characteristics with model size, achieving strong experimental results and revealing inherent memory limitations.

Authors:Wangtao Sun, Haotian Xu, Huanxuan Liao, Xuanqing Yu, Zhongtao Jiang, Shizhu He, Jun Zhao, Kang Liu
Title: Shuttle Between the Instructions and the Parameters of Large Language Models
Abstract:
The interaction with Large Language Models (LLMs) through instructions has been extensively investigated in the research community. While instructions have been widely used as the guidelines for task solving, this paper further notices that both instructions and parameters are the compression of task data. Therefore, they could be strongly correlated and can be learned to predict one from the other. This paper proposes a novel neural network framework, SHIP (\textbf{Sh}uttle between the \textbf{I}nstructions and the \textbf{P}arameters), to model and learn the mutual mappings between the instructions and the parameters of LLMs. We verify that SHIP can effectively map one of the instructions/parameters to the other by evaluating it on the tasks of instruction deduction and induction. The results show that SHIP performs better than existing baseline methods in terms of deductive capabilities while significantly surpassing them in inductive capabilities. Moreover, SHIP can effectively combine the two mapping processes to perform excellent inductive reasoning. The code and data for this paper are released at https://anonymous.4open.science/r/Shuttle-Between-Instructions-Parameters/.
中文摘要:本文提出SHIP神经网络框架,通过建模大语言模型中指令与参数间的相互映射关系,在指令推导与归纳任务上展现出优于基线方法的演绎和归纳能力。
English Summary: This paper introduces SHIP, a neural network framework that models the mutual mappings between instructions and parameters in Large Language Models, demonstrating superior performance in both deductive and inductive tasks compared to baseline methods.

Authors:Ahmed Heakl, Sara Ghaboura, Omkar Thawkar, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan
Title: AIN: The Arabic INclusive Large Multimodal Model
Abstract:
Amid the swift progress of large language models (LLMs) and their evolution into large multimodal models (LMMs), significant strides have been made in high-resource languages such as English and Chinese. While Arabic LLMs have seen notable progress, Arabic LMMs remain largely unexplored, often narrowly focusing on a few specific aspects of the language and visual understanding. To bridge this gap, we introduce AIN-the Arabic Inclusive Multimodal Model-designed to excel across diverse domains. AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic, leveraging carefully constructed 3.6 million high-quality Arabic-English multimodal data samples. AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities. On the recent CAMEL-Bench benchmark comprising 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding, our AIN demonstrates strong performance with the 7B model outperforming GPT-4o by an absolute gain of 3.4% averaged over eight domains and 38 sub-domains. AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools across diverse applications.
中文: AIN作为阿拉伯语-英语双语多模态模型,在38个领域实现顶尖性能,以3.4%优势超越GPT-4o,有效填补了阿拉伯语多模态人工智能的研究空白。
English: The AIN model is a bilingual Arabic-English multimodal system that achieves state-of-the-art performance across 38 domains, outperforming GPT-4o by 3.4% while addressing the underdevelopment of Arabic multimodal AI.

Authors:Jinghao Feng, Qiaoyu Zheng, Chaoyi Wu, Ziheng Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie
Title: M^3Builder: A Multi-Agent System for Automated Machine Learning in Medical Imaging
Abstract:
Agentic AI systems have gained significant attention for their ability to autonomously perform complex tasks. However, their reliance on well-prepared tools limits their applicability in the medical domain, which requires to train specialized models. In this paper, we make three contributions: (i) We present M3Builder, a novel multi-agent system designed to automate machine learning (ML) in medical imaging. At its core, M3Builder employs four specialized agents that collaborate to tackle complex, multi-step medical ML workflows, from automated data processing and environment configuration to self-contained auto debugging and model training. These agents operate within a medical imaging ML workspace, a structured environment designed to provide agents with free-text descriptions of datasets, training codes, and interaction tools, enabling seamless communication and task execution. (ii) To evaluate progress in automated medical imaging ML, we propose M3Bench, a benchmark comprising four general tasks on 14 training datasets, across five anatomies and three imaging modalities, covering both 2D and 3D data. (iii) We experiment with seven state-of-the-art large language models serving as agent cores for our system, such as Claude series, GPT-4o, and DeepSeek-V3. Compared to existing ML agentic designs, M3Builder shows superior performance on completing ML tasks in medical imaging, achieving a 94.29% success rate using Claude-3.7-Sonnet as the agent core, showing huge potential towards fully automated machine learning in medical imaging.
中文: 本文提出M3Builder多智能体系统,通过专业代理自动化医学影像中的机器学习流程,实现了94.29%的任务成功率,展现出在该领域实现全自动机器学习的巨大潜力。
English: This paper introduces M3Builder, a multi-agent system that automates machine learning workflows in medical imaging through specialized agents, achieving a 94.29% success rate and demonstrating strong potential for fully automated ML in this field.

Authors:Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui
Title: Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
Abstract:
Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.
中文: DeepSeek提出的多头潜在注意力(MLA)通过将KV缓存压缩为潜在向量实现高效推理,而MHA2MLA微调方法仅需少量数据即可使大语言模型从MHA迁移至MLA,在保持性能的同时显著降低推理成本。
English: DeepSeek's Multi-head Latent Attention (MLA) compresses the KV cache into a latent vector for efficient inference, and the proposed MHA2MLA fine-tuning method enables LLMs to transition from MHA to MLA using minimal data while maintaining performance and significantly reducing costs.

Authors:Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui
Title: Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
Abstract:
Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.
中文: DeepSeek提出的多头潜在注意力(MLA)通过将KV缓存压缩为潜在向量实现高效推理,而MHA2MLA微调方法仅需少量数据即可使大语言模型从MHA迁移至MLA,在保持性能的同时显著降低推理成本。
English: DeepSeek's Multi-head Latent Attention (MLA) compresses the KV cache into a latent vector for efficient inference, and the proposed MHA2MLA fine-tuning method enables LLMs to transition from MHA to MLA using minimal data while maintaining performance and significantly reducing costs.

Authors:Haicheng Wang, Chen Ju, Weixiong Lin, Chaofan Ma, Shuai Xiao, Ya Zhang, Yanfeng Wang
Title: Contrast-Unity for Partially-Supervised Temporal Sentence Grounding
Abstract:
Temporal sentence grounding aims to detect event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great results but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation costs, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip is available during training. To make full use of partial labels, we specially design one contrast-unity framework, with the two-stage goal of implicit-explicit progressive grounding. In the implicit stage, we align event-query representations at fine granularity using comprehensive quadruple contrastive learning: event-query gather, event-background separation, intra-cluster compactness and inter-cluster separability. Then, high-quality representations bring acceptable grounding pseudo-labels. In the explicit stage, to explicitly optimize grounding objectives, we train one fully-supervised model using obtained pseudo-labels for grounding refinement and denoising. Extensive experiments and thoroughly ablations on Charades-STA and ActivityNet Captions demonstrate the significance of partial supervision, as well as our superior performance.
中文: 本文提出了一种部分监督的时间语句定位框架,通过短片段标注在降低标注成本的同时实现高性能,采用对比统一方法进行隐式表征对齐和显式伪标签优化。
English: This paper introduces a partially-supervised framework for temporal sentence grounding that uses short-clip annotations to achieve high performance with reduced labeling costs, employing a contrast-unity approach with implicit representation alignment and explicit pseudo-label refinement.

Authors:Kento Kawaharazuka, Manabu Nishiura, Shinsuke Nakashima, Yasunori Toshimitsu, Yusuke Omura, Yuya Koga, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba
Title: Stability Recognition with Active Vibration for Bracing Behaviors and Motion Extensions Using Environment in Musculoskeletal Humanoids
Abstract:
Although robots with flexible bodies are superior in terms of the contact and adaptability, it is difficult to control them precisely. On the other hand, human beings make use of the surrounding environments to stabilize their bodies and control their movements. In this study, we propose a method for the bracing motion and extension of the range of motion using the environment for the musculoskeletal humanoid. Here, it is necessary to recognize the stability of the body when contacting the environment, and we develop a method to measure it by using the change in sensor values of the body when actively vibrating a part of the body. Experiments are conducted using the musculoskeletal humanoid Musashi, and the effectiveness of this method is confirmed.
中文: 本研究提出一种方法,使肌肉骨骼类人机器人能通过接触环境表面来增强稳定性和扩展运动范围,利用主动振动身体部位时传感器数值的变化来测量稳定性,并在人形机器人Musashi上的实验验证了该方法的有效性。
English: This study proposes a method for musculoskeletal humanoids to enhance stability and extend their range of motion by bracing against environmental surfaces, using active body vibrations to measure stability through sensor value changes, with experiments on the humanoid Musashi confirming its effectiveness.

Authors:Kento Kawaharazuka, Kei Tsuzuki, Moritaka Onitsuka, Yuya Koga, Yusuke Omura, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba
Title: Reflex-based Motion Strategy of Musculoskeletal Humanoids under Environmental Contact Using Muscle Relaxation Control
Abstract:
The musculoskeletal humanoid can move well under environmental contact thanks to its body softness. However, there are few studies that actively make use of the environment to rest its flexible musculoskeletal body. Also, its complex musculoskeletal structure is difficult to modelize and high internal muscle tension sometimes occurs. To solve these problems, we develop a muscle relaxation control which can minimize the muscle tension by actively using the environment and inhibit useless internal muscle tension. We apply this control to some basic movements, the motion of resting the arms on the desk, and handle operation, and verify its effectiveness.
中文: 本研究提出了一种肌肉松弛控制方法,通过主动利用环境接触来最小化肌肉骨骼仿人机器人的肌肉张力,有效解决了复杂建模和内部张力的问题。
English: This study introduces a muscle relaxation control method that minimizes muscle tension in musculoskeletal humanoids by actively utilizing environmental contact, effectively addressing issues of complex modeling and internal tension.

Authors:Kento Kawaharazuka, Naoki Hiraoka, Yuya Koga, Manabu Nishiura, Yusuke Omura, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba
Title: Online Learning of Danger Avoidance for Complex Structures of Musculoskeletal Humanoids and Its Applications
Abstract:
The complex structure of musculoskeletal humanoids makes it difficult to model them, and the inter-body interference and high internal muscle force are unavoidable. Although various safety mechanisms have been developed to solve this problem, it is important not only to deal with the dangers when they occur but also to prevent them from happening. In this study, we propose a method to learn a network outputting danger probability corresponding to the muscle length online so that the robot can gradually prevent dangers from occurring. Applications of this network for control are also described. The method is applied to the musculoskeletal humanoid, Musashi, and its effectiveness is verified.
中文: 本研究提出了一种在线学习方法,通过神经网络根据肌肉长度预测危险概率,使肌肉骨骼类人机器人如Musashi能够主动预防而非仅应对风险,并通过应用验证了其有效性。
English: This study proposes an online method for learning a network that predicts danger probability based on muscle length, enabling musculoskeletal humanoids like Musashi to proactively prevent risks rather than just responding to them, with its effectiveness validated through application.

Authors:Kento Kawaharazuka, Yuya Koga, Kei Tsuzuki, Moritaka Onitsuka, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba
Title: Applications of Stretch Reflex for the Upper Limb of Musculoskeletal Humanoids: Protective Behavior, Postural Stability, and Active Induction
Abstract:
The musculoskeletal humanoid has various biomimetic benefits, and it is important that we can embed and evaluate human reflexes in the actual robot. Although stretch reflex has been implemented in lower limbs of musculoskeletal humanoids, we apply it to the upper limb to discover its useful applications. We consider the implementation of stretch reflex in the actual robot, its active/passive applications, and the change in behavior according to the difference of parameters.
中文: 本研究在肌肉骨骼仿人机器人的上肢中实现拉伸反射,以探索其实际应用,并考察其实现方式、主动与被动用途以及参数差异引起的行为变化。
English: This study implements stretch reflex in the upper limbs of a musculoskeletal humanoid to explore its practical applications, examining its implementation, active and passive uses, and behavioral changes based on parameter variations.

Authors:Kento Kawaharazuka, Yuya Koga, Kei Tsuzuki, Moritaka Onitsuka, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba
Title: Exceeding the Maximum Speed Limit of the Joint Angle for the Redundant Tendon-driven Structures of Musculoskeletal Humanoids
Abstract:
The musculoskeletal humanoid has various biomimetic benefits, and the redundant muscle arrangement is one of its most important characteristics. This redundancy can achieve fail-safe redundant actuation and variable stiffness control. However, there is a problem that the maximum joint angle velocity is limited by the slowest muscle among the redundant muscles. In this study, we propose two methods that can exceed the limited maximum joint angle velocity, and verify the effectiveness with actual robot experiments.
中文: 本研究针对肌肉骨骼机器人冗余肌肉中最慢肌肉限制关节最大角速度的问题,提出了两种突破该限制的方法,并通过实际机器人实验验证了其有效性。
English: This study proposes two methods to overcome the limitation of maximum joint angle velocity caused by the slowest muscle in redundant muscle arrangements of musculoskeletal humanoids, validating their effectiveness through actual robot experiments.

Authors:Kento Kawaharazuka, Yasunori Toshimitsu, Manabu Nishiura, Yuya Koga, Yusuke Omura, Yuki Asano, Kei Okada, Koji Kawasaki, Masayuki Inaba
Title: Design Optimization of Musculoskeletal Humanoids with Maximization of Redundancy to Compensate for Muscle Rupture
Abstract:
Musculoskeletal humanoids have various biomimetic advantages, and the redundant muscle arrangement allowing for variable stiffness control is one of the most important. In this study, we focus on one feature of the redundancy, which enables the humanoid to keep moving even if one of its muscles breaks, an advantage that has not been dealt with in many studies. In order to make the most of this advantage, the design of muscle arrangement is optimized by considering the maximization of minimum available torque that can be exerted when one muscle breaks. This method is applied to the elbow of a musculoskeletal humanoid Musashi with simulations, the design policy is extracted from the optimization results, and its effectiveness is confirmed with the actual robot.
中文: 本研究优化了肌肉骨骼人形机器人的肌肉布局,以在单块肌肉失效时最大化可用扭矩,并通过仿真和实际机器人Musashi的肘部实验验证了该方法的有效性。
English: This study optimizes muscle arrangement in musculoskeletal humanoids to maximize torque availability after muscle failure, validating the approach through simulations and real-world testing on the Musashi robot's elbow.

Authors:Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li
Title: DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance
Abstract:
To alleviate memory burden during inference of large language models (LLMs), numerous studies have focused on compressing the KV cache by exploring aspects such as attention sparsity. These techniques are often designed with a pre-defined KV budget; however, as the optimal budget varies by different input lengths and task types, the existence of a fixed budget could result in inconsistent performance accepting inputs of diverse domains. To address this limitation, we propose a new KV cache compression objective: to always ensure the full-cache performance regardless of specific inputs, while maximizing KV cache pruning as much as possible. To achieve this goal, we introduce a novel KV cache compression method dubbed DBudgetKV, which features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Empirical evaluation spanning diverse context lengths, task types, and model sizes suggests that our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average. Furthermore, our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
中文摘要:针对大语言模型中固定KV缓存预算的局限,我们提出DBudgetKV压缩方法,通过性能监测机制在可能影响性能时停止压缩,确保全缓存性能的同时实现最大化剪枝,平均压缩率超25%且降低推理耗时。
English Summary: To address the limitations of fixed KV cache budgets in large language models, we propose DBudgetKV, a compression method that ensures full-cache performance while maximizing pruning by halting compression when performance may degrade, achieving over 25% average compression with reduced inference time.

Authors:Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, Shanghang Zhang
Title: LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information
Abstract:
Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.
中文摘要: 本文提出通过过程监督增强长文本生成,采用蒙特卡洛树搜索结合全局记忆池和外部评估来优化逐步偏好对,实验表明该方法在长文本基准上提升了生成长度和质量,同时在不同模型架构上保持通用基准的近乎无损性能。
English Summary: This paper proposes enhancing long-form generation through process supervision using Monte Carlo Tree Search with global memory and external critiques to improve stepwise preference pairs, demonstrating improved length and quality on specialized benchmarks while maintaining general performance.

Authors:Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Jie Zhou
Title: DeepRAG: Thinking to Retrieve Step by Step for Large Language Models
Abstract:
Large Language Models (LLMs) have shown remarkable reasoning capabilities, while their practical applications are limited by severe factual hallucinations due to limitations in the timeliness, accuracy, and comprehensiveness of their parametric knowledge. Meanwhile, enhancing retrieval-augmented generation (RAG) with reasoning remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling reasonable and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency and boosts answer accuracy by 26.4%, demonstrating its effectiveness in enhancing retrieval-augmented reasoning.
中文: DeepRAG提出了一种将检索增强推理建模为马尔可夫决策过程的框架,通过自适应检索和查询分解提高了效率,并使答案准确率提升了26.4%。
English: DeepRAG introduces a framework that models retrieval-augmented reasoning as a Markov Decision Process, enabling adaptive retrieval and query decomposition to improve efficiency and boost answer accuracy by 26.4%.

Authors:Yan Yu, Wengang Zhou, Yaodong Yang, Wanxuan Lu, Yingyan Hou, Houqiang Li
Title: Model Evolution Framework with Genetic Algorithm for Multi-Task Reinforcement Learning
Abstract:
Multi-task reinforcement learning employs a single policy to complete various tasks, aiming to develop an agent with generalizability across different scenarios. Given the shared characteristics of tasks, the agent's learning efficiency can be enhanced through parameter sharing. Existing approaches typically use a routing network to generate specific routes for each task and reconstruct a set of modules into diverse models to complete multiple tasks simultaneously. However, due to the inherent difference between tasks, it is crucial to allocate resources based on task difficulty, which is constrained by the model's structure. To this end, we propose a Model Evolution framework with Genetic Algorithm (MEGA), which enables the model to evolve during training according to the difficulty of the tasks. When the current model is insufficient for certain tasks, the framework will automatically incorporate additional modules, enhancing the model's capabilities. Moreover, to adapt to our model evolution framework, we introduce a genotype module-level model, using binary sequences as genotype policies for model reconstruction, while leveraging a non-gradient genetic algorithm to optimize these genotype policies. Unlike routing networks with fixed output dimensions, our approach allows for the dynamic adjustment of the genotype policy length, enabling it to accommodate models with a varying number of modules. We conducted experiments on various robotics manipulation tasks in the Meta-World benchmark. Our state-of-the-art performance demonstrated the effectiveness of the MEGA framework. We will release our source code to the public.
中文:MEGA框架采用遗传算法驱动的模型进化方法,当任务超出当前能力时动态增强多任务强化学习策略,在机器人操作基准测试中实现了最优性能。
English: The MEGA framework introduces a genetic algorithm-driven model evolution approach that dynamically enhances a multi-task reinforcement learning policy by adding modules when tasks exceed current capabilities, achieving state-of-the-art performance in robotics manipulation benchmarks.

Authors:Jianfeng Cai, Jinhua Zhu, Ruopei Sun, Yue Wang, Li Li, Wengang Zhou, Houqiang Li
Title: Disentangling Length Bias In Preference Learning Via Response-Conditioned Modeling
Abstract:
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs) by modeling human preferences with a learnable reward model and employing a reinforcement learning algorithm to maximize the reward model's scores. However, these reward models are susceptible to exploitation through various superficial confounding factors, with length bias emerging as a particularly significant concern. Moreover, while the pronounced impact of length bias on preference modeling suggests that LLMs possess an inherent sensitivity to length perception, our preliminary investigations reveal that fine-tuned LLMs consistently struggle to adhere to explicit length instructions. To address these two limitations, we propose a novel framework wherein the reward model explicitly differentiates between human semantic preferences and response length requirements. Specifically, we introduce a $\textbf{R}$esponse-$\textbf{c}$onditioned $\textbf{B}$radley-$\textbf{T}$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following, through training on our augmented dataset. Furthermore, we propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization (DPO) of LLMs, simultaneously mitigating length bias and promoting adherence to length instructions. Extensive experiments across various foundational models and datasets demonstrate the effectiveness and generalizability of our approach.
中文: 该框架通过响应条件Bradley-Terry模型及配套算法,在人类反馈强化学习中明确区分语义偏好与长度要求,有效缓解长度偏差并提升模型对长度指令的遵循能力,在多模型和数据集上验证了其有效性。
English: The proposed framework introduces a Response-conditioned Bradley-Terry model and corresponding algorithms to address length bias in reinforcement learning from human feedback by explicitly separating semantic preferences from length requirements, improving both bias mitigation and instruction adherence across models and datasets.

Authors:Yi Jing, Zijun Yao, Hongzhu Guo, Lingxu Ran, Xiaozhi Wang, Lei Hou, Juanzi Li
Title: LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder
Abstract:
Large language models (LLMs) demonstrate exceptional performance on tasks requiring complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Prior research on linguistic mechanisms is limited by coarse granularity, limited analysis scale, and narrow focus. In this study, we propose LinguaLens, a systematic and comprehensive framework for analyzing the linguistic mechanisms of large language models, based on Sparse Auto-Encoders (SAEs). We extract a broad set of Chinese and English linguistic features across four dimensions (morphology, syntax, semantics, and pragmatics). By employing counterfactual methods, we construct a large-scale counterfactual dataset of linguistic features for mechanism analysis. Our findings reveal intrinsic representations of linguistic knowledge in LLMs, uncover patterns of cross-layer and cross-lingual distribution, and demonstrate the potential to control model outputs. This work provides a systematic suite of resources and methods for studying linguistic mechanisms, offers strong evidence that LLMs possess genuine linguistic knowledge, and lays the foundation for more interpretable and controllable language modeling in future research.
中文: 大语言模型通过内在表征和跨语言分布模式展现出真正的语言学知识,LinguaLens框架基于稀疏自编码器和反事实分析方法,为研究语言机制提供了系统资源并奠定了可解释建模的基础。
English: Large language models demonstrate genuine linguistic knowledge through intrinsic representations and cross-lingual patterns, as revealed by LinguaLens—a systematic framework using sparse auto-encoders and counterfactual analysis across multiple linguistic dimensions.

Authors:Ruichen Zhang, Shunpu Tang, Yinqiu Liu, Dusit Niyato, Zehui Xiong, Sumei Sun, Shiwen Mao, Zhu Han
Title: Toward Agentic AI: Generative Information Retrieval Inspired Intelligent Communications and Networking
Abstract:
The increasing complexity and scale of modern telecommunications networks demand intelligent automation to enhance efficiency, adaptability, and resilience. Agentic AI has emerged as a key paradigm for intelligent communications and networking, enabling AI-driven agents to perceive, reason, decide, and act within dynamic networking environments. However, effective decision-making in telecom applications, such as network planning, management, and resource allocation, requires integrating retrieval mechanisms that support multi-hop reasoning, historical cross-referencing, and compliance with evolving 3GPP standards. This article presents a forward-looking perspective on generative information retrieval-inspired intelligent communications and networking, emphasizing the role of knowledge acquisition, processing, and retrieval in agentic AI for telecom systems. We first provide a comprehensive review of generative information retrieval strategies, including traditional retrieval, hybrid retrieval, semantic retrieval, knowledge-based retrieval, and agentic contextual retrieval. We then analyze their advantages, limitations, and suitability for various networking scenarios. Next, we present a survey about their applications in communications and networking. Additionally, we introduce an agentic contextual retrieval framework to enhance telecom-specific planning by integrating multi-source retrieval, structured reasoning, and self-reflective validation. Experimental results demonstrate that our framework significantly improves answer accuracy, explanation consistency, and retrieval efficiency compared to traditional and semantic retrieval methods. Finally, we outline future research directions.
中文: 本文探讨了生成式信息检索如何通过先进检索策略及新型框架增强电信领域智能体AI,提升网络任务决策的准确性与效率。
English: This article explores how generative information retrieval enhances agentic AI in telecommunications by improving decision-making through advanced retrieval strategies and a proposed framework that boosts accuracy and efficiency in network tasks.

Authors:Siwei Tu, Ben Fei, Weidong Yang, Fenghua Ling, Hao Chen, Zili Liu, Kun Chen, Hang Fan, Wanli Ouyang, Lei Bai
Title: Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution
Abstract:
Accurate acquisition of surface meteorological conditions at arbitrary locations holds significant importance for weather forecasting and climate simulation. Due to the fact that meteorological states derived from satellite observations are often provided in the form of low-resolution grid fields, the direct application of spatial interpolation to obtain meteorological states for specific locations often results in significant discrepancies when compared to actual observations. Existing downscaling methods for acquiring meteorological state information at higher resolutions commonly overlook the correlation with satellite observations. To bridge the gap, we propose Satellite-observations Guided Diffusion Model (SGD), a conditional diffusion model pre-trained on ERA5 reanalysis data with satellite observations (GridSat) as conditions, which is employed for sampling downscaled meteorological states through a zero-shot guided sampling strategy and patch-based methods. During the training process, we propose to fuse the information from GridSat satellite observations into ERA5 maps via the attention mechanism, enabling SGD to generate atmospheric states that align more accurately with actual conditions. In the sampling, we employed optimizable convolutional kernels to simulate the upscale process, thereby generating high-resolution ERA5 maps using low-resolution ERA5 maps as well as observations from weather stations as guidance. Moreover, our devised patch-based method promotes SGD to generate meteorological states at arbitrary resolutions. Experiments demonstrate SGD fulfills accurate meteorological states downscaling to 6.25km.
中文摘要:本文提出的卫星观测引导扩散模型(SGD)通过注意力机制和分块方法将卫星观测数据与ERA5再分析数据相融合,成功将气象数据降尺度至6.25公里分辨率,相比传统方法能更精确地反映实际气象状况。
English Summary: The proposed Satellite-observations Guided Diffusion Model (SGD) effectively downscales meteorological data to 6.25km resolution by integrating satellite observations with ERA5 reanalysis data through attention mechanisms and patch-based methods, achieving greater alignment with actual conditions than traditional approaches.

Authors:Wanghan Xu, Xiaoyu Yue, Zidong Wang, Yao Teng, Wenlong Zhang, Xihui Liu, Luping Zhou, Wanli Ouyang, Lei Bai
Title: Exploring Representation-Aligned Latent Space for Better Generation
Abstract:
Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and video synthesis. Latent diffusion models are typically trained using Variational Autoencoders (VAEs), interacting with VAE latents rather than the real samples. While this generative paradigm speeds up training and inference, the quality of the generated outputs is limited by the latents' quality. Traditional VAE latents are often seen as spatial compression in pixel space and lack explicit semantic representations, which are essential for modeling the real world. In this paper, we introduce ReaLS (Representation-Aligned Latent Space), which integrates semantic priors to improve generation performance. Extensive experiments show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric. Furthermore, the enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
Chinese: ReaLS框架通过将语义先验融入潜在空间,提升了潜在扩散模型的性能,使FID指标改善15%,并增强了分割和深度估计等下游任务的感知能力。
English: The ReaLS framework enhances latent diffusion models by incorporating semantic priors into the latent space, leading to a 15% improvement in FID scores and enabling better performance in downstream tasks like segmentation and depth estimation.

Authors:Yixing Fan, Qiang Yan, Wenshan Wang, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng
Title: TrustRAG: An Information Assistant with Retrieval Augmented Generation
Abstract:
\Ac{RAG} has emerged as a crucial technique for enhancing large models with real-time and domain-specific knowledge. While numerous improvements and open-source tools have been proposed to refine the \ac{RAG} framework for accuracy, relatively little attention has been given to improving the trustworthiness of generated results. To address this gap, we introduce TrustRAG, a novel framework that enhances \ac{RAG} from three perspectives: indexing, retrieval, and generation. Specifically, in the indexing stage, we propose a semantic-enhanced chunking strategy that incorporates hierarchical indexing to supplement each chunk with contextual information, ensuring semantic completeness. In the retrieval stage, we introduce a utility-based filtering mechanism to identify high-quality information, supporting answer generation while reducing input length. In the generation stage, we propose fine-grained citation enhancement, which detects opinion-bearing sentences in responses and infers citation relationships at the sentence-level, thereby improving citation accuracy. We open-source the TrustRAG framework and provide a demonstration studio designed for excerpt-based question answering tasks \footnote{https://huggingface.co/spaces/golaxy/TrustRAG}. Based on these, we aim to help researchers: 1) systematically enhancing the trustworthiness of \ac{RAG} systems and (2) developing their own \ac{RAG} systems with more reliable outputs.
Chinese: TrustRAG是一个新颖的框架,通过语义分块、效用过滤和细粒度引用的方法,在索引、检索和生成三个阶段提升RAG系统的可信度,旨在支持可靠输出和系统性开发。
English: TrustRAG is a novel framework that enhances the trustworthiness of RAG systems by improving indexing, retrieval, and generation stages with semantic chunking, utility filtering, and fine-grained citation, aiming to support reliable outputs and systematic development.

Authors:Wanqing Cui, Keping Bi, Jiafeng Guo, Xueqi Cheng
Title: Estimating Commonsense Plausibility through Semantic Shifts
Abstract:
Commonsense plausibility estimation is critical for evaluating language models (LMs), yet existing generative approaches--reliant on likelihoods or verbalized judgments--struggle with fine-grained discrimination. In this paper, we propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information. Plausible augmentations induce minimal shifts in semantics, while implausible ones result in substantial deviations. Evaluations on two types of fine-grained commonsense plausibility estimation tasks across different backbones, including LLMs and vision-language models (VLMs), show that ComPaSS consistently outperforms baselines. It demonstrates the advantage of discriminative approaches over generative methods in fine-grained commonsense plausibility evaluation. Experiments also show that (1) VLMs yield superior performance to LMs, when integrated with ComPaSS, on vision-grounded commonsense tasks. (2) contrastive pre-training sharpens backbone models' ability to capture semantic nuances, thereby further enhancing ComPaSS.
Chinese: 本文提出ComPaSS判别式框架,通过测量常识增强句子的语义偏移来量化常识合理性,在细粒度任务中优于生成式方法,并验证了视觉语言模型与对比预训练对提升语义细微差异捕捉能力的强化作用。
English: The paper introduces ComPaSS, a discriminative framework that assesses commonsense plausibility by measuring semantic shifts from augmented sentences, demonstrating superior performance over generative methods in fine-grained tasks and highlighting the benefits of vision-language models and contrastive pre-training.

Authors:Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, Xueqi Cheng
Title: Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
Abstract:
Large language models (LLMs) exhibit impressive performance across diverse tasks but often struggle to accurately gauge their knowledge boundaries, leading to confident yet incorrect responses. This paper explores leveraging LLMs' internal states to enhance their perception of knowledge boundaries from efficiency and risk perspectives. We investigate whether LLMs can estimate their confidence using internal states before response generation, potentially saving computational resources. Our experiments on datasets like Natural Questions, HotpotQA, and MMLU reveal that LLMs demonstrate significant pre-generation perception, which is further refined post-generation, with perception gaps remaining stable across varying conditions. To mitigate risks in critical domains, we introduce Confidence Consistency-based Calibration ($C^3$), which assesses confidence consistency through question reformulation. $C^3$ significantly improves LLMs' ability to recognize their knowledge gaps, enhancing the unknown perception rate by 5.6% on NQ and 4.9% on HotpotQA. Our findings suggest that pre-generation confidence estimation can optimize efficiency, while $C^3$ effectively controls output risks, advancing the reliability of LLMs in practical applications.
中文摘要:本研究通过利用大语言模型的内部状态,在生成回答前进行置信度估计以提高效率,并提出一种校准方法增强其识别知识边界的能力,从而降低关键应用中的风险。
English Summary: This study demonstrates that large language models can estimate their confidence before generating responses using internal states, improving efficiency, and introduces a calibration method that enhances their ability to recognize knowledge boundaries, thereby reducing risks in critical applications.

Authors:Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, Xinggang Wang
Title: RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Abstract:
Existing end-to-end autonomous driving (AD) algorithms typically follow the Imitation Learning (IL) paradigm, which faces challenges such as causal confusion and the open-loop gap. In this work, we establish a 3DGS-based closed-loop Reinforcement Learning (RL) training paradigm. By leveraging 3DGS techniques, we construct a photorealistic digital replica of the real physical world, enabling the AD policy to extensively explore the state space and learn to handle out-of-distribution scenarios through large-scale trial and error. To enhance safety, we design specialized rewards that guide the policy to effectively respond to safety-critical events and understand real-world causal relationships. For better alignment with human driving behavior, IL is incorporated into RL training as a regularization term. We introduce a closed-loop evaluation benchmark consisting of diverse, previously unseen 3DGS environments. Compared to IL-based methods, RAD achieves stronger performance in most closed-loop metrics, especially 3x lower collision rate. Abundant closed-loop results are presented at https://hgao-cv.github.io/RAD.
中文: 本文提出了一种基于3DGS的闭环强化学习自动驾驶框架,通过在逼真数字环境中进行安全探索来克服模仿学习的局限性,实现了碰撞率降低三倍的显著效果。
English: This paper introduces a 3DGS-based closed-loop reinforcement learning framework for autonomous driving that overcomes imitation learning limitations by enabling safe exploration in photorealistic environments, achieving a threefold reduction in collision rates.

Authors:Zhuoqun Li, Haiyang Yu, Xuanang Chen, Hongyu Lin, Yaojie Lu, Fei Huang, Xianpei Han, Yongbin Li, Le Sun
Title: DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
Abstract:
Designing solutions for complex engineering challenges is crucial in human production activities. However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently addressed tasks related to the design of complex engineering solutions. To fill this gap, we introduce a new benchmark, SolutionBench, to evaluate a system's ability to generate complete and feasible solutions for engineering problems with multiple complex constraints. To further advance the design of complex engineering solutions, we propose a novel system, SolutionRAG, that leverages the tree-based exploration and bi-point thinking mechanism to generate reliable solutions. Extensive experimental results demonstrate that SolutionRAG achieves state-of-the-art (SOTA) performance on the SolutionBench, highlighting its potential to enhance the automation and reliability of complex engineering solution design in real-world applications.
中文摘要:作者提出了用于评估复杂工程解决方案生成的基准SolutionBench,并开发了采用树状探索和双点思维机制的新系统SolutionRAG,该系统在基准测试中实现了最优性能。
English Summary: The authors introduce SolutionBench, a benchmark for evaluating complex engineering solution generation, and propose SolutionRAG, a novel system using tree-based exploration and bi-point thinking that achieves state-of-the-art performance.

Authors:Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, Silvio Savarese, Huan Wang, Caiming Xiong, Shelby Heinecke
Title: PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data
Abstract:
Personalization is critical in AI assistants, particularly in the context of private AI models that work with individual users. A key scenario in this domain involves enabling AI models to access and interpret a user's private data (e.g., conversation history, user-AI interactions, app usage) to understand personal details such as biographical information, preferences, and social connections. However, due to the sensitive nature of such data, there are no publicly available datasets that allow us to assess an AI model's ability to understand users through direct access to personal information. To address this gap, we introduce a synthetic data generation pipeline that creates diverse, realistic user profiles and private documents simulating human activities. Leveraging this synthetic data, we present PersonaBench, a benchmark designed to evaluate AI models' performance in understanding personal information derived from simulated private user data. We evaluate Retrieval-Augmented Generation (RAG) pipelines using questions directly related to a user's personal information, supported by the relevant private documents provided to the models. Our results reveal that current retrieval-augmented AI models struggle to answer private questions by extracting personal information from user documents, highlighting the need for improved methodologies to enhance personalization capabilities in AI.
中文摘要:为解决缺乏评估AI模型从私人用户数据中理解个人信息能力的公开数据集问题,研究人员开发了合成数据生成流程和PersonaBench基准测试,发现当前检索增强模型在从用户文档中提取个人细节方面存在困难。
English Summary: To address the lack of public datasets for evaluating AI models' ability to understand personal information from private user data, researchers developed a synthetic data generation pipeline and PersonaBench benchmark, revealing that current retrieval-augmented models struggle with extracting personal details from user documents.

Authors:Jingtao Zhan, Jiahao Zhao, Jiayu Li, Yiqun Liu, Bo Zhang, Qingyao Ai, Jiaxin Mao, Hongning Wang, Min Zhang, Shaoping Ma
Title: Evaluating Intelligence via Trial and Error
Abstract:
Intelligence is a crucial trait for species to find solutions within a limited number of trial-and-error attempts. Building on this idea, we introduce Survival Game as a framework to evaluate intelligence based on the number of failed attempts in a trial-and-error process. Fewer failures indicate higher intelligence. When the expectation and variance of failure counts are both finite, it signals the ability to consistently find solutions to new challenges, which we define as the Autonomous Level of intelligence. Using Survival Game, we comprehensively evaluate existing AI systems. Our results show that while AI systems achieve the Autonomous Level in simple tasks, they are still far from it in more complex tasks, such as vision, search, recommendation, and language. While scaling current AI technologies might help, this would come at an astronomical cost. Projections suggest that achieving the Autonomous Level for general tasks would require $10^{26}$ parameters. To put this into perspective, loading such a massive model requires so many H100 GPUs that their total value is $10^{7}$ times that of Apple Inc.'s market value. Even with Moore's Law, supporting such a parameter scale would take $70$ years. This staggering cost highlights the complexity of human tasks and the inadequacies of current AI technologies. To further investigate this phenomenon, we conduct a theoretical analysis of Survival Game and its experimental results. Our findings suggest that human tasks possess a criticality property. As a result, Autonomous Level requires a deep understanding of the task's underlying mechanisms. Current AI systems, however, do not fully grasp these mechanisms and instead rely on superficial mimicry, making it difficult for them to reach an autonomous level. We believe Survival Game can not only guide the future development of AI but also offer profound insights into human intelligence.
中文摘要:生存游戏框架通过试错过程中的失败次数评估智能水平,研究表明当前人工智能系统仅在简单任务中实现自主性,而在复杂领域因依赖表面模仿而非深层任务理解仍远未达标。
English Summary: The Survival Game framework evaluates intelligence by failure counts in trial-and-error processes, revealing that current AI systems achieve autonomy only in simple tasks but fall short in complex domains due to their reliance on superficial mimicry rather than deep task understanding.

Authors:Jiachen Zhu, Congmin Zheng, Jianghao Lin, Kounianhua Du, Ying Wen, Yong Yu, Jun Wang, Weinan Zhang
Title: Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning
Abstract:
While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies key OOD issues, including step OOD, caused by differences in reasoning patterns across model types and sizes, and question OOD, which arises from dataset shifts between training data and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps as a warmup, enhancing PRM's ability to evaluate target steps and improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetrievalPRM model, establishing a new standard for PRM performance.
中文: 本文提出RetrievalPRM框架,通过两阶段检索机制增强过程奖励模型处理分布外问题的能力,在多个真实数据集上的实验表明其优于现有基线,并建立了新的性能标准。
English: This paper introduces RetrievalPRM, a novel framework that addresses out-of-distribution challenges in Process Reward Models by using a two-stage retrieval mechanism to enhance reasoning consistency and generalization across different models and problem types, outperforming existing baselines in experiments.

Authors:Yunjia Xi, Muyan Weng, Wen Chen, Chao Yi, Dian Chen, Gaoyang Guo, Mao Zhang, Jian Wu, Yuning Jiang, Qingwen Liu, Yong Yu, Weinan Zhang
Title: Bursting Filter Bubble: Enhancing Serendipity Recommendations with Aligned Large Language Models
Abstract:
Recommender systems (RSs) often suffer from the feedback loop phenomenon, e.g., RSs are trained on data biased by their recommendations. This leads to the filter bubble effect that reinforces homogeneous content and reduces user satisfaction. To this end, serendipity recommendations, which offer unexpected yet relevant items, are proposed. Recently, large language models (LLMs) have shown potential in serendipity prediction due to their extensive world knowledge and reasoning capabilities. However, they still face challenges in aligning serendipity judgments with human assessments, handling long user behavior sequences, and meeting the latency requirements of industrial RSs. To address these issues, we propose SERAL (Serendipity Recommendations with Aligned Large Language Models), a framework comprising three stages: (1) Cognition Profile Generation to compress user behavior into multi-level profiles; (2) SerenGPT Alignment to align serendipity judgments with human preferences using enriched training data; and (3) Nearline Adaptation to integrate SerenGPT into industrial RSs pipelines efficiently. Online experiments demonstrate that SERAL improves exposure ratio (PVR), clicks, and transactions of serendipitous items by 5.7%, 29.56%, and 27.6%, enhancing user experience without much impact on overall revenue. Now, it has been fully deployed in the "Guess What You Like" of the Taobao App homepage.
中文:为解决推荐系统中的过滤气泡效应,SERAL框架利用大语言模型通过行为压缩、人类偏好对齐和高效工业集成来生成意外相关推荐,在实际应用中显著提升了用户互动指标。
English: To address the filter bubble effect in recommender systems, the SERAL framework leverages large language models to generate serendipitous recommendations through behavior compression, human-aligned judgments, and efficient industrial integration, significantly improving user engagement metrics in real-world deployment.

Authors:Jingxiao Chen, Xinyao Li, Jiahang Cao, Zhengbang Zhu, Wentao Dong, Minghuan Liu, Ying Wen, Yong Yu, Liqing Zhang, Weinan Zhang
Title: RHINO: Learning Real-Time Humanoid-Human-Object Interaction from Human Demonstrations
Abstract:
Humanoid robots have shown success in locomotion and manipulation. Despite these basic abilities, humanoids are still required to quickly understand human instructions and react based on human interaction signals to become valuable assistants in human daily life. Unfortunately, most existing works only focus on multi-stage interactions, treating each task separately, and neglecting real-time feedback. In this work, we aim to empower humanoid robots with real-time reaction abilities to achieve various tasks, allowing human to interrupt robots at any time, and making robots respond to humans immediately. To support such abilities, we propose a general humanoid-human-object interaction framework, named RHINO, i.e., Real-time Humanoid-human Interaction and Object manipulation. RHINO provides a unified view of reactive motion, instruction-based manipulation, and safety concerns, over multiple human signal modalities, such as languages, images, and motions. RHINO is a hierarchical learning framework, enabling humanoids to learn reaction skills from human-human-object demonstrations and teleoperation data. In particular, it decouples the interaction process into two levels: 1) a high-level planner inferring human intentions from real-time human behaviors; and 2) a low-level controller achieving reactive motion behaviors and object manipulation skills based on the predicted intentions. We evaluate the proposed framework on a real humanoid robot and demonstrate its effectiveness, flexibility, and safety in various scenarios.
中文摘要:本研究提出RHINO分层框架,通过解析多模态人类信号并将交互过程解耦为高层意图推断与底层运动控制,使人形机器人能够执行实时反应任务。
English Summary: This work introduces RHINO, a hierarchical framework enabling humanoid robots to perform real-time reactive tasks by interpreting multimodal human signals and decoupling interactions into high-level intention inference and low-level motion control.

Authors:Kounianhua Du, Hanjing Wang, Jianxing Liu, Jizheng Chen, Xinyi Dai, Yasheng Wang, Ruiming Tang, Yong Yu, Jun Wang, Weinan Zhang
Title: Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in various domains, particularly in system 1 tasks, yet the intricacies of their problem-solving mechanisms in system 2 tasks are not sufficiently explored. Recent research on System2-to-System1 methods surge, exploring the System 2 reasoning knowledge via inference-time computation and compressing the explored knowledge into System 1 process. In this paper, we focus on code generation, which is a representative System 2 task, and identify two primary challenges: (1) the complex hidden reasoning processes and (2) the heterogeneous data distributions that complicate the exploration and training of robust LLM solvers. To tackle these issues, we propose a novel BDC framework that explores insightful System 2 knowledge of LLMs using a MC-Tree-Of-Agents algorithm with mutual \textbf{B}oosting, \textbf{D}isentangles the heterogeneous training data for composable LoRA-experts, and obtain \textbf{C}ustomized problem solver for each data instance with an input-aware hypernetwork to weight over the LoRA-experts, offering effectiveness, flexibility, and robustness. This framework leverages multiple LLMs through mutual verification and boosting, integrated into a Monte-Carlo Tree Search process enhanced by reflection-based pruning and refinement. Additionally, we introduce the DisenLora algorithm, which clusters heterogeneous data to fine-tune LLMs into composable Lora experts, enabling the adaptive generation of customized problem solvers through an input-aware hypernetwork. This work lays the groundwork for advancing LLM capabilities in complex reasoning tasks, offering a novel System2-to-System1 solution.
中文: 本文提出新颖的BDC框架,通过相互增强算法探索系统2推理知识,并解构异构数据以生成定制化问题求解器,从而提升大语言模型在复杂代码生成任务中的表现。
English: This paper introduces a novel BDC framework that enhances large language models' performance in complex code generation by exploring System 2 reasoning through mutual boosting algorithms and disentangling heterogeneous data for customized problem solvers.

Authors:Hanxing Ding, Shuchang Tao, Liang Pang, Zihao Wei, Jinyang Gao, Bolin Ding, Huawei Shen, Xueqi Cheng
Title: ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models
Abstract:
Tool learning has emerged as a crucial capability for large language models (LLMs) to solve complex real-world tasks through interaction with external tools. Existing approaches face significant challenges, including reliance on hand-crafted prompts, difficulty in multi-step planning, and lack of precise error diagnosis and reflection mechanisms. We propose ToolCoder, a novel framework that reformulates tool learning as a code generation task. Inspired by software engineering principles, ToolCoder transforms natural language queries into structured Python function scaffold and systematically breaks down tasks with descriptive comments, enabling LLMs to leverage coding paradigms for complex reasoning and planning. It then generates and executes function implementations to obtain final responses. Additionally, ToolCoder stores successfully executed functions in a repository to promote code reuse, while leveraging error traceback mechanisms for systematic debugging, optimizing both execution efficiency and robustness. Experiments demonstrate that ToolCoder achieves superior performance in task completion accuracy and execution reliability compared to existing approaches, establishing the effectiveness of code-centric approaches in tool learning.
中文: ToolCoder是一种创新框架,将工具学习重新定义为代码生成任务,借鉴软件工程原理提升复杂推理、规划和错误处理能力,在任务完成准确性和执行可靠性方面优于现有方法。
English: ToolCoder is a novel framework that reformulates tool learning as a code generation task, leveraging software engineering principles to enhance complex reasoning, planning, and error handling, achieving superior performance in task completion accuracy and execution reliability compared to existing methods.

Authors:Jingcheng Deng, Zhongtao Jiang, Liang Pang, Liwei Chen, Kun Xu, Zihao Wei, Huawei Shen, Xueqi Cheng
Title: Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment
Abstract:
A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.
中文摘要:AutoRegEmbed是一种基于嵌入条件概率分布的新型对比学习方法,通过整合信息压缩和条件分布对齐两项核心任务,有效解决了大语言模型生成式嵌入与对比学习的不兼容问题,在相同数据量下显著超越传统方法并达到先进模型性能。
English Summary: AutoRegEmbed is a novel contrastive learning method that addresses the incompatibility between LLMs' generative embeddings and contrastive learning by integrating information compression and conditional distribution alignment, achieving superior performance over traditional approaches and matching state-of-the-art results with equivalent data.

Authors:Hanxing Ding, Shuchang Tao, Liang Pang, Zihao Wei, Liwei Chen, Kun Xu, Huawei Shen, Xueqi Cheng
Title: Revisiting Robust RAG: Do We Still Need Complex Robust Training in the Era of Powerful LLMs?
Abstract:
Retrieval-augmented generation (RAG) systems often suffer from performance degradation when encountering noisy or irrelevant documents, driving researchers to develop sophisticated training strategies to enhance their robustness against such retrieval noise. However, as large language models (LLMs) continue to advance, the necessity of these complex training methods is increasingly questioned. In this paper, we systematically investigate whether complex robust training strategies remain necessary as model capacity grows. Through comprehensive experiments spanning multiple model architectures and parameter scales, we evaluate various document selection methods and adversarial training techniques across diverse datasets. Our extensive experiments consistently demonstrate that as models become more powerful, the performance gains brought by complex robust training methods drop off dramatically. We delve into the rationale and find that more powerful models inherently exhibit superior confidence calibration, better generalization across datasets (even when trained with randomly selected documents), and optimal attention mechanisms learned with simpler strategies. Our findings suggest that RAG systems can benefit from simpler architectures and training strategies as models become more powerful, enabling more scalable applications with minimal complexity.
中文: 随着语言模型能力的增强,检索增强生成系统中复杂鲁棒训练方法的边际效益显著降低,更强大的模型凭借其更好的置信度校准和泛化能力,仅通过简单训练即可达到相当甚至更优的性能。
English: As language models grow more powerful, the marginal benefits of complex robust training methods in retrieval-augmented generation systems substantially diminish, with larger models achieving comparable performance using simpler approaches due to their inherent capabilities like better confidence calibration and generalization.

Authors:Hanxing Ding, Shuchang Tao, Liang Pang, Zihao Wei, Liwei Chen, Kun Xu, Huawei Shen, Xueqi Cheng
Title: On the Diminishing Returns of Complex Robust RAG Training in the Era of Powerful LLMs
Abstract:
Retrieval-augmented generation (RAG) systems traditionally employ sophisticated training strategies to enhance robustness against retrieval noise. In this work, we investigate a critical question: does the benefit of these complex robust training methods diminish as language models become more powerful? Through systematic evaluation across multiple model scales and question-answering datasets, our analysis reveals a consistent trend: \emph{the marginal robustness benefit of sophisticated training strategies decreases substantially as model capacity increases.} While smaller models show significant performance improvements from complex document selection and adversarial objectives, more capable models achieve comparable or even superior performance with simpler training approaches. Further investigation demonstrates that stronger models naturally exhibit better confidence calibration, cross-dataset generalization capability, and more effective attention patterns, even under simple training regimes. These findings suggest that as foundation models evolve, the engineering effort invested in complex robust training may yield diminishing returns, indicating that simplified RAG pipelines could suffice for powerful models while maintaining competitive performance.
中文: 随着语言模型能力的增强,检索增强生成系统中复杂鲁棒训练方法的边际效益显著降低,更强大的模型凭借其更好的置信度校准和泛化能力,仅通过简单训练即可达到相当甚至更优的性能。
English: As language models grow more powerful, the marginal benefits of complex robust training methods in retrieval-augmented generation systems substantially diminish, with larger models achieving comparable performance using simpler approaches due to their inherent capabilities like better confidence calibration and generalization.

Authors:Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, Xueqi Cheng
Title: The Mirage of Model Editing: Revisiting Evaluation in the Wild
Abstract:
Despite near-perfect results reported in the literature, the effectiveness of model editing in real-world applications remains unclear. To bridge this gap, we introduce QAEdit, a new benchmark aligned with widely used question answering (QA) datasets, and WILD, a task-agnostic evaluation framework designed to better reflect real-world usage of model editing. Our single editing experiments show that current editing methods perform substantially worse than previously reported (38.5% vs. 96.8%). We demonstrate that it stems from issues in the synthetic evaluation practices of prior work. Among them, the most severe is the use of teacher forcing during testing, which leaks both content and length of the ground truth, leading to overestimated performance. Furthermore, we simulate practical deployment by sequential editing, revealing that current approaches fail drastically with only 1000 edits. This work calls for a shift in model editing research toward rigorous evaluation and the development of robust, scalable methods that can reliably update knowledge in LLMs for real-world use.
Chinese: 现有模型编辑方法在真实场景中表现远低于预期(成功率38.5% vs 报告值96.8%),主要源于评估方法的缺陷和较差的扩展性,亟需建立更严谨的评估标准及开发鲁棒的编辑技术。
English: Current model editing methods significantly underperform in real-world scenarios, with a 38.5% success rate versus the reported 96.8%, due to flawed evaluation practices and poor scalability, necessitating more rigorous benchmarks and robust techniques.

Authors:Haowen Gao, Liang Pang, Shicheng Xu, Leigang Qu, Tat-Seng Chua, Huawei Shen, Xueqi Cheng
Title: Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos
Abstract:
With the rapid development of AI-generated content (AIGC), the creation of high-quality AI-generated videos has become faster and easier, resulting in the Internet being flooded with all kinds of video content. However, the impact of these videos on the content ecosystem remains largely unexplored. Video information retrieval remains a fundamental approach for accessing video content. Building on the observation that retrieval models often favor AI-generated content in ad-hoc and image retrieval tasks, we investigate whether similar biases emerge in the context of challenging video retrieval, where temporal and visual factors may further influence model behavior. To explore this, we first construct a comprehensive benchmark dataset containing both real and AI-generated videos, along with a set of fair and rigorous metrics to assess bias. This benchmark consists of 13,000 videos generated by two state-of-the-art open-source video generation models. We meticulously design a suite of rigorous metrics to accurately measure this preference, accounting for potential biases arising from the limited frame rate and suboptimal quality of AIGC videos. We then applied three off-the-shelf video retrieval models to perform retrieval tasks on this hybrid dataset. Our findings reveal a clear preference for AI-generated videos in retrieval. Further investigation shows that incorporating AI-generated videos into the training set of retrieval models exacerbates this bias. Unlike the preference observed in image modalities, we find that video retrieval bias arises from both unseen visual and temporal information, making the root causes of video bias a complex interplay of these two factors. To mitigate this bias, we fine-tune the retrieval models using a contrastive learning approach. The results of this study highlight the potential implications of AI-generated videos on retrieval systems.
中文: 研究表明,视频检索模型因视觉和时间因素明显偏向AI生成内容,通过对比学习微调可缓解此偏差。
English: The study reveals that video retrieval models exhibit a clear bias favoring AI-generated content due to both visual and temporal factors, which can be mitigated through contrastive learning fine-tuning.

Authors:Zenghao Duan, Wenbin Duan, Zhiyi Yin, Yinghan Shen, Shaoling Jing, Jie Zhang, Huawei Shen, Xueqi Cheng
Title: Related Knowledge Perturbation Matters: Rethinking Multiple Pieces of Knowledge Editing in Same-Subject
Abstract:
Knowledge editing has become a promising approach for efficiently and precisely updating knowledge embedded in large language models (LLMs). In this work, we focus on Same-Subject Editing, which involves modifying multiple attributes of a single entity to ensure comprehensive and consistent updates to entity-centric knowledge. Through preliminary observation, we identify a significant challenge: Current state-of-the-art editing methods struggle when tasked with editing multiple related knowledge pieces for the same subject. To address the lack of relevant editing data for identical subjects in traditional benchmarks, we introduce the $\text{S}^2\text{RKE}$(Same-Subject Related Knowledge Editing) benchmark. Our extensive experiments reveal that only mainstream locate-then-edit methods, such as ROME and MEMIT, exhibit "related knowledge perturbation," where subsequent edits interfere with earlier ones. Further analysis reveals that these methods over-rely on subject information, neglecting other critical factors, resulting in reduced editing effectiveness.
中文: 大语言模型中的知识编辑在更新单个实体的多个属性时面临挑战,为此引入的S²RKE基准测试表明,现有方法如ROME和MEMIT因过度依赖主体信息而产生相关知识干扰问题。
English: Knowledge editing in large language models faces challenges in updating multiple attributes of a single entity, leading to the introduction of the S²RKE benchmark which reveals that current methods like ROME and MEMIT suffer from related knowledge perturbation due to over-reliance on subject information.

Authors:Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, Caiming Xiong
Title: BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation
Abstract:
Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.
中文: 本文提出BOLT方法,无需依赖先进模型蒸馏或昂贵人工标注,通过三阶段训练自举实现大语言模型的长思维链推理能力,在多项基准测试中展现出卓越性能。
English: This paper introduces BOLT, a novel method that bootstraps long chain-of-thought reasoning in LLMs without relying on distillation from advanced models or costly human annotations, achieving strong performance across multiple benchmarks through a three-stage training process.

Authors:Yingxuan Yang, Bo Huang, Siyuan Qi, Chao Feng, Haoyi Hu, Yuxuan Zhu, Jinbo Hu, Haoran Zhao, Ziyi He, Xiao Liu, Zongyu Wang, Lin Qiu, Xuezhi Cao, Xunliang Cai, Yong Yu, Weinan Zhang
Title: Who's the MVP? A Game-Theoretic Evaluation Benchmark for Modular Attribution in LLM Agents
Abstract:
Large Language Model (LLM) agents frameworks often employ modular architectures, incorporating components such as planning, reasoning, action execution, and reflection to tackle complex tasks. However, quantifying the contribution of each module to overall system performance remains a significant challenge, impeding optimization and interpretability. To address this, we introduce CapaBench (Capability-level Assessment Benchmark), an evaluation framework grounded in cooperative game theory's Shapley Value, which systematically measures the marginal impact of individual modules and their interactions within an agent's architecture. By replacing default modules with test variants across all possible combinations, CapaBench provides a principle method for attributing performance contributions. Key contributions include: (1) We are the first to propose a Shapley Value-based methodology for quantifying the contributions of capabilities in LLM agents; (2) Modules with high Shapley Values consistently lead to predictable performance gains when combined, enabling targeted optimization; and (3) We build a multi-round dataset of over 1,500 entries spanning diverse domains and practical task scenarios, enabling comprehensive evaluation of agent capabilities. CapaBench bridges the gap between component-level evaluation and holistic system assessment, providing actionable insights for optimizing modular LLM agents and advancing their deployment in complex, real-world scenarios.
Chinese: CapaBench基于沙普利值提出评估框架,系统量化LLM智能体中各模块及其交互的边际贡献,为优化模块化架构和复杂场景部署提供可操作的评估方法。
English: CapaBench introduces a Shapley Value-based framework to systematically quantify the contributions of individual modules and their interactions in LLM agents, enabling targeted optimization and comprehensive evaluation across diverse tasks.

Authors:Kaike Zhang, Qi Cao, Yunfan Wu, Fei Sun, Huawei Shen, Xueqi Cheng
Title: Personalized Denoising Implicit Feedback for Robust Recommender System
Abstract:
While implicit feedback is foundational to modern recommender systems, factors such as human error, uncertainty, and ambiguity in user behavior inevitably introduce significant noise into this feedback, adversely affecting the accuracy and robustness of recommendations. To address this issue, existing methods typically aim to reduce the training weight of noisy feedback or discard it entirely, based on the observation that noisy interactions often exhibit higher losses in the overall loss distribution. However, we identify two key issues: (1) there is a significant overlap between normal and noisy interactions in the overall loss distribution, and (2) this overlap becomes even more pronounced when transitioning from pointwise loss functions (e.g., BCE loss) to pairwise loss functions (e.g., BPR loss). This overlap leads traditional methods to misclassify noisy interactions as normal, and vice versa. To tackle these challenges, we further investigate the loss overlap and find that for a given user, there is a clear distinction between normal and noisy interactions in the user's personal loss distribution. Based on this insight, we propose a resampling strategy to Denoise using the user's Personal Loss distribution, named PLD, which reduces the probability of noisy interactions being optimized. Specifically, during each optimization iteration, we create a candidate item pool for each user and resample the items from this pool based on the user's personal loss distribution, prioritizing normal interactions. Additionally, we conduct a theoretical analysis to validate PLD's effectiveness and suggest ways to further enhance its performance. Extensive experiments conducted on three datasets with varying noise ratios demonstrate PLD's efficacy and robustness.
中文摘要:本文提出PLD方法,通过利用用户个人损失分布来区分并优先处理正常交互而非噪声数据,从而在不同噪声水平下有效提升推荐系统的准确性与鲁棒性。
English Summary: This paper introduces PLD, a novel resampling strategy that leverages users' personal loss distributions to distinguish and prioritize normal interactions over noisy ones, thereby enhancing recommendation accuracy and robustness across various noise levels.

Authors:Xinning Zhou, Chengyang Ying, Yao Feng, Hang Su, Jun Zhu
Title: Self-Consistent Model-based Adaptation for Visual Reinforcement Learning
Abstract:
Visual reinforcement learning agents typically face serious performance declines in real-world applications caused by visual distractions. Existing methods rely on fine-tuning the policy's representations with hand-crafted augmentations. In this work, we propose Self-Consistent Model-based Adaptation (SCMA), a novel method that fosters robust adaptation without modifying the policy. By transferring cluttered observations to clean ones with a denoising model, SCMA can mitigate distractions for various policies as a plug-and-play enhancement. To optimize the denoising model in an unsupervised manner, we derive an unsupervised distribution matching objective with a theoretical analysis of its optimality. We further present a practical algorithm to optimize the objective by estimating the distribution of clean observations with a pre-trained world model. Extensive experiments on multiple visual generalization benchmarks and real robot data demonstrate that SCMA effectively boosts performance across various distractions and exhibits better sample efficiency.
中文: SCMA作为一种即插即用的方法,通过无监督去噪模型将杂乱观测转换为清晰观测,无需修改策略即可增强视觉强化学习智能体对干扰的鲁棒性。
English: SCMA is a plug-and-play method that enhances visual reinforcement learning agents' robustness to distractions by transferring cluttered observations to clean ones using an unsupervised denoising model, without requiring policy modifications.

Authors:Chengyang Ying, Huayu Chen, Xinning Zhou, Zhongkai Hao, Hang Su, Jun Zhu
Title: Exploratory Diffusion Model for Unsupervised Reinforcement Learning
Abstract:
Unsupervised reinforcement learning (URL) aims to pre-train agents by exploring diverse states or skills in reward-free environments, facilitating efficient adaptation to downstream tasks. As the agent cannot access extrinsic rewards during unsupervised exploration, existing methods design intrinsic rewards to model the explored data and encourage further exploration. However, the explored data are always heterogeneous, posing the requirements of powerful representation abilities for both intrinsic reward models and pre-trained policies. In this work, we propose the Exploratory Diffusion Model (ExDM), which leverages the strong expressive ability of diffusion models to fit the explored data, simultaneously boosting exploration and providing an efficient initialization for downstream tasks. Specifically, ExDM can accurately estimate the distribution of collected data in the replay buffer with the diffusion model and introduces the score-based intrinsic reward, encouraging the agent to explore less-visited states. After obtaining the pre-trained policies, ExDM enables rapid adaptation to downstream tasks. In detail, we provide theoretical analyses and practical algorithms for fine-tuning diffusion policies, addressing key challenges such as training instability and computational complexity caused by multi-step sampling. Extensive experiments demonstrate that ExDM outperforms existing SOTA baselines in efficient unsupervised exploration and fast fine-tuning downstream tasks, especially in structurally complicated environments.
Chinese Summary: 探索性扩散模型(ExDM)利用扩散模型的强大表达能力来拟合探索数据,通过基于分数的内在奖励促进探索,并为下游任务提供快速适应的预训练策略。
English Summary: The Exploratory Diffusion Model (ExDM) leverages diffusion models to enhance unsupervised reinforcement learning by accurately modeling explored data and introducing score-based intrinsic rewards, enabling efficient exploration and rapid adaptation to downstream tasks.

Authors:Boxiong Wang, Hui Kang, Jiahui Li, Geng Sun, Zemin Sun, Jiacheng Wang, Dusit Niyato
Title: UAV-assisted Joint Mobile Edge Computing and Data Collection via Matching-enabled Deep Reinforcement Learning
Abstract:
Unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) and data collection (DC) have been popular research issues. Different from existing works that consider MEC and DC scenarios separately, this paper investigates a multi-UAV-assisted joint MEC-DC system. Specifically, we formulate a joint optimization problem to minimize the MEC latency and maximize the collected data volume. This problem can be classified as a non-convex mixed integer programming problem that exhibits long-term optimization and dynamics. Thus, we propose a deep reinforcement learning-based approach that jointly optimizes the UAV movement, user transmit power, and user association in real time to solve the problem efficiently. Specifically, we reformulate the optimization problem into an action space-reduced Markov decision process (MDP) and optimize the user association by using a two-phase matching-based association (TMA) strategy. Subsequently, we propose a soft actor-critic (SAC)-based approach that integrates the proposed TMA strategy (SAC-TMA) to solve the formulated joint optimization problem collaboratively. Simulation results demonstrate that the proposed SAC-TMA is able to coordinate the two subsystems and can effectively reduce the system latency and improve the data collection volume compared with other benchmark algorithms.
中文摘要:本文提出一种基于深度强化学习的多无人机系统,通过协调无人机移动、传输功率和用户关联,联合优化移动边缘计算延迟与数据收集量,相比基准算法显著降低了系统延迟并提高了数据收集量。
English Summary: This paper introduces a deep reinforcement learning-based approach for a multi-UAV system that jointly optimizes mobile edge computing latency and data collection volume, demonstrating superior performance through coordinated UAV movement, transmit power control, and user association.

Authors:Zihao Li, Ruixiang Tang, Lu Cheng, Shuaiqiang Wang, Dawei Yin, Mengnan Du
Title: DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models
Abstract:
Pre-trained language models (PLMs) have achieved impressive results on various natural language processing tasks. However, recent research has revealed that these models often rely on superficial features and shortcuts instead of developing a genuine understanding of language, especially for natural language understanding (NLU) tasks. Consequently, the models struggle to generalize to out-of-domain data. In this work, we propose Divergence Based Regularization (DBR) to mitigate this shortcut learning behavior. Our method measures the divergence between the output distributions for original examples and examples where shortcut tokens have been masked. This process prevents the model's predictions from being overly influenced by shortcut features or biases. We evaluate our model on three NLU tasks and find that it improves out-of-domain performance with little loss of in-domain accuracy. Our results demonstrate that reducing the reliance on shortcuts and superficial features can enhance the generalization ability of large pre-trained language models.
中文: 本研究提出的基于差异的正则化方法通过测量原始样本与屏蔽捷径标记样本的输出分布差异,有效减少了预训练语言模型对表面特征的依赖,在保持域内性能的同时显著提升了跨域泛化能力。
English: The proposed Divergence Based Regularization (DBR) method reduces shortcut learning in pre-trained language models by measuring output divergence between original and shortcut-masked examples, thereby improving out-of-domain generalization with minimal impact on in-domain performance.

Authors:Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha
Title: Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
Abstract:
Large Reasoning Models (LRMs) have recently demonstrated impressive performances across diverse domains. However, how the safety of Large Language Models (LLMs) benefits from enhanced reasoning capabilities against jailbreak queries remains unexplored. To bridge this gap, in this paper, we propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates a safety-aware reasoning mechanism into LLMs' generation process. This enables self-evaluation at each step of the reasoning process, forming safety pivot tokens as indicators of the safety status of responses. Furthermore, in order to improve the accuracy of predicting pivot tokens, we propose Contrastive Pivot Optimization (CPO), which enhances the model's perception of the safety status of given dialogues. LLMs dynamically adjust their response strategies during reasoning, significantly enhancing their safety capabilities defending jailbreak attacks. Extensive experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances. This highlights the substantial potential of safety-aware reasoning in improving robustness of LRMs and LLMs against various jailbreaks.
中文摘要:本文提出“推理防御”(R2D)训练范式,通过将安全感知推理机制融入大语言模型的生成过程,使其在推理步骤中自我评估并动态调整响应策略,实验证明该方法能有效抵御越狱攻击且不损害模型原有性能。
English Summary: The paper introduces Reasoning-to-Defend (R2D), a training method that enhances LLM safety against jailbreak attacks by incorporating safety-aware reasoning and self-evaluation during generation, with experiments showing improved robustness without compromising performance.

Authors:Yudi Zhang, Lu Wang, Meng Fang, Yali Du, Chenghua Huang, Jun Wang, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Title: Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?
Abstract:
Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT). However, this approach neglects the potential to distill both data (output content) and reward signals (quality evaluations). Extracting reliable reward signals directly from teacher models is challenging, as LLMs are optimized for generation rather than evaluation, often resulting in biased or inconsistent assessments. To address this limitation, we propose a novel distillation pipeline that transfers both responses and rewards. Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses, enabling reward learning without explicit external evaluation. The reward model subsequently guides reinforcement learning (RL), allowing iterative refinement of the student model after an SFT warm-up phase. Experiments on GSM8K and MMLU-PRO demonstrate that our method consistently outperforms traditional SFT-based approaches, enabling student models to surpass the performance of their teachers. This work highlights the potential for scalable, efficient distillation through structured self-supervised reward learning, reducing dependence on external reward supervision.
中文摘要:本文提出一种新颖的蒸馏方法,通过传输教师模型响应和自监督伪奖励信号来指导学生模型优化,在超越传统监督微调方法的同时实现了学生模型性能反超教师模型。
English Summary: This paper introduces a novel distillation method that transfers both teacher model responses and self-supervised pseudo-rewards to guide student model refinement, outperforming traditional supervised fine-tuning approaches and enabling students to surpass teacher performance.

Authors:Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Title: VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
Abstract:
Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.
中文摘要:本文提出了一种无需环境的强化学习框架,通过预训练的价值环境模型直接从离线数据预测动作价值,实现了与界面布局无关的图形界面自动化,在避免环境交互成本的同时,在Android基准测试中取得了最优性能。
English Summary: This paper introduces an environment-free reinforcement learning framework using a pretrained Value Environment Model (VEM) that predicts action values directly from offline data, enabling layout-agnostic GUI automation while avoiding costly environmental interactions and achieving state-of-the-art performance on Android benchmarks.

Authors:Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Title: Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance
Abstract:
Proximal Policy Optimization (PPO)-based Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human preferences. It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance. This approach increases computational complexity and instability due to actor-critic interdependence. Additionally, PPO lacks access to true environment rewards in LLM tasks, limiting its adaptability. Under such conditions, pretraining a value model or a reward model becomes equivalent, as both provide fixed supervisory signals without new ground-truth feedback. To address these issues, we propose \textbf{Decoupled Value Policy Optimization (DVPO)}, a lean framework that replaces traditional reward modeling with a pretrained \emph{global value model (GVM)}. The GVM is conditioned on policy trajectories and predicts token-level return-to-go estimates. By decoupling value model from policy training (via frozen GVM-driven RL objectives), DVPO eliminates actor-critic interdependence, reducing GPU memory usage by 40\% and training time by 35\% compared to conventional RLHF. Experiments across benchmarks show DVPO outperforms efficient RLHF methods (e.g., DPO) while matching state-of-the-art PPO in performance.
中文摘要:提出的解耦价值策略优化(DVPO)框架使用预训练的全局价值模型替代传统奖励建模,消除了行动者-评论者相互依赖,在保持PPO性能的同时将GPU内存降低40%、训练时间减少35%。
English Summary: The proposed Decoupled Value Policy Optimization (DVPO) framework replaces traditional reward modeling with a pretrained global value model to eliminate actor-critic interdependence, reducing computational costs by 40% in GPU memory and 35% in training time while matching PPO performance.

Authors:Lingxiang Hu, Shurun Yuan, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang
Title: MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf
Abstract:
In contemporary workplaces, meetings are essential for exchanging ideas and ensuring team alignment but often face challenges such as time consumption, scheduling conflicts, and inefficient participation. Recent advancements in Large Language Models (LLMs) have demonstrated their strong capabilities in natural language generation and reasoning, prompting the question: can LLMs effectively delegate participants in meetings? To explore this, we develop a prototype LLM-powered meeting delegate system and create a comprehensive benchmark using real meeting transcripts. Our evaluation reveals that GPT-4/4o maintain balanced performance between active and cautious engagement strategies. In contrast, Gemini 1.5 Pro tends to be more cautious, while Gemini 1.5 Flash and Llama3-8B/70B display more active tendencies. Overall, about 60\% of responses address at least one key point from the ground-truth. However, improvements are needed to reduce irrelevant or repetitive content and enhance tolerance for transcription errors commonly found in real-world settings. Additionally, we implement the system in practical settings and collect real-world feedback from demos. Our findings underscore the potential and challenges of utilizing LLMs as meeting delegates, offering valuable insights into their practical application for alleviating the burden of meetings.
中文: 本研究探讨利用大型语言模型作为会议代理,发现尽管GPT-4等模型能平衡参与策略并在约60%的回复中触及关键点,但仍需改进以减少无关内容并提升对现实场景中转录错误的容错能力。
English: This study explores using Large Language Models (LLMs) as meeting delegates, finding that while models like GPT-4 balance engagement strategies and address key points in about 60% of responses, they require improvements to reduce irrelevant content and handle transcription errors for real-world use.

Authors:Xueru Wen, Jie Lou, Zichao Li, Yaojie Lu, Xing Yu, Yuqiu Ji, Guohai Xu, Hongyu Lin, Ben He, Xianpei Han, Le Sun, Debing Zhang
Title: Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
Abstract:
Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.
中文: 为解决中文奖励模型资源匮乏的问题,我们推出了人工标注的CheemsBench评估基准和通过人机协作构建的CheemsPreference数据集,研究表明仅靠人工智能生成的数据难以准确反映人类偏好,并强调了高质量人工监督在模型训练中的关键作用。
English: CheemsBench and CheemsPreference are introduced as a human-annotated benchmark and a collaborative human-machine dataset to address the scarcity of reliable Chinese reward model resources, revealing that AI-generated data alone inadequately captures human preferences and highlighting the critical need for human supervision in training effective models.

Authors:Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Haoran Chen, Shiwei Shen, Jiaqi Leng, Zuxuan Wu, Yu-Gang Jiang
Title: Human2Robot: Learning Robot Actions from Paired Human-Robot Videos
Abstract:
Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing methods, which often rely on coarsely-aligned video pairs, are typically constrained to learning global or task-level features. As a result, they tend to neglect the fine-grained frame-level dynamics required for complex manipulation and generalization to novel tasks. We posit that this limitation stems from a vicious circle of inadequate datasets and the methods they inspire. To break this cycle, we propose a paradigm shift that treats fine-grained human-robot alignment as a conditional video generation problem. To this end, we first introduce H&R, a novel third-person dataset containing 2,600 episodes of precisely synchronized human and robot motions, collected using a VR teleoperation system. We then present Human2Robot, a framework designed to leverage this data. Human2Robot employs a Video Prediction Model to learn a rich and implicit representation of robot dynamics by generating robot videos from human input, which in turn guides a decoupled action decoder. Our real-world experiments demonstrate that this approach not only achieves high performance on seen tasks but also exhibits significant one-shot generalization to novel positions, objects, instances, and even new task categories.
中文摘要:本研究提出了一种将精细人机对齐视为条件视频生成问题的新范式,通过同步数据集和视频预测模型,使机器人在复杂任务中实现卓越性能及对新任务的泛化能力。
English Summary: This study introduces a novel approach to robot learning by treating fine-grained human-robot alignment as a conditional video generation problem, using a synchronized dataset and a Video Prediction Model to achieve superior performance and generalization in complex tasks.

Authors:Liang Wang, Shaozhen Liu, Yu Rong, Deli Zhao, Qiang Liu, Shu Wu, Liang Wang
Title: MolSpectra: Pre-training 3D Molecular Representation with Multi-modal Energy Spectra
Abstract:
Establishing the relationship between 3D structures and the energy states of molecular systems has proven to be a promising approach for learning 3D molecular representations. However, existing methods are limited to modeling the molecular energy states from classical mechanics. This limitation results in a significant oversight of quantum mechanical effects, such as quantized (discrete) energy level structures, which offer a more accurate estimation of molecular energy and can be experimentally measured through energy spectra. In this paper, we propose to utilize the energy spectra to enhance the pre-training of 3D molecular representations (MolSpectra), thereby infusing the knowledge of quantum mechanics into the molecular representations. Specifically, we propose SpecFormer, a multi-spectrum encoder for encoding molecular spectra via masked patch reconstruction. By further aligning outputs from the 3D encoder and spectrum encoder using a contrastive objective, we enhance the 3D encoder's understanding of molecules. Evaluations on public benchmarks reveal that our pre-trained representations surpass existing methods in predicting molecular properties and modeling dynamics.
中文摘要:本文提出MolSpectra方法,通过SpecFormer编码器和对比学习将量子力学能谱融入三维分子表征,在分子性质预测和动力学建模方面优于现有方法。
English Summary: This paper introduces MolSpectra, a method that enhances 3D molecular representations by incorporating quantum mechanical energy spectra through SpecFormer and contrastive learning, outperforming existing approaches in molecular property prediction and dynamics modeling.

Authors:Ruiying Peng, Kaiyuan Li, Weichen Zhang, Chen Gao, Xinlei Chen, Yong Li
Title: Understanding and Evaluating Hallucinations in 3D Visual Language Models
Abstract:
Recently, 3D-LLMs, which combine point-cloud encoders with large models, have been proposed to tackle complex tasks in embodied intelligence and scene understanding. In addition to showing promising results on 3D tasks, we found that they are significantly affected by hallucinations. For instance, they may generate objects that do not exist in the scene or produce incorrect relationships between objects. To investigate this issue, this work presents the first systematic study of hallucinations in 3D-LLMs. We begin by quickly evaluating hallucinations in several representative 3D-LLMs and reveal that they are all significantly affected by hallucinations. We then define hallucinations in 3D scenes and, through a detailed analysis of datasets, uncover the underlying causes of these hallucinations. We find three main causes: (1) Uneven frequency distribution of objects in the dataset. (2) Strong correlations between objects. (3) Limited diversity in object attributes. Additionally, we propose new evaluation metrics for hallucinations, including Random Point Cloud Pair and Opposite Question Evaluations, to assess whether the model generates responses based on visual information and aligns it with the text's meaning.
中文: 近期结合点云编码器与大模型的3D-LLMs在具身智能任务中展现潜力,但存在严重幻觉问题,如生成场景中不存在的物体或错误关系;本研究首次系统性地定义了3D幻觉,通过新评估指标揭示其根源在于数据集物体分布不均、关联性强及属性多样性不足。
English: Recent 3D-LLMs combining point-cloud encoders with large models show promise in embodied intelligence but suffer from significant hallucinations, such as generating nonexistent objects or incorrect relationships, prompting the first systematic study to define, evaluate, and identify causes like dataset imbalances and limited diversity.

Authors:Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang
Title: PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the Vision Token Re-attention phenomenon across decoder layers, we dynamically adjust token retention rates layer by layer. Layers that exhibit stronger attention to visual information preserve more vision tokens, while layers with lower vision attention are aggressively pruned. Furthermore, PLPHP applies pruning at the attention head level, enabling different heads within the same layer to independently retain critical context. Experiments on multiple benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of 0.46% average performance drop, while also achieving notable performance improvements in multi-image tasks. These results highlight the effectiveness of fine-grained token pruning and contribute to advancing the efficiency and scalability of LVLMs. Our source code will be made publicly available.
中文摘要:提出的PLPHP方法通过细粒度视觉令牌剪枝提升大型视觉语言模型效率,在解码速度提升18%和KV缓存减少超50%的同时,仅造成0.46%的平均性能损失。
English Summary: The proposed PLPHP method enhances Large Vision-Language Models' efficiency through fine-grained token pruning, achieving 18% faster decoding and over 50% KV Cache reduction with minimal performance impact.

Authors:Haisong Gong, Jing Li, Junfei Wu, Qiang Liu, Shu Wu, Liang Wang
Title: STRIVE: Structured Reasoning for Self-Improvement in Claim Verification
Abstract:
Claim verification is the task of determining whether a claim is supported or refuted by evidence. Self-improvement methods, where reasoning chains are generated and those leading to correct results are selected for training, have succeeded in tasks like mathematical problem solving. However, in claim verification, this approach struggles. Low-quality reasoning chains may falsely match binary truth labels, introducing faulty reasoning into the self-improvement process and ultimately degrading performance. To address this, we propose STRIVE: Structured Reasoning for Self-Improved Verification. Our method introduces a structured reasoning design with Claim Decomposition, Entity Analysis, and Evidence Grounding Verification. These components improve reasoning quality, reduce errors, and provide additional supervision signals for self-improvement. STRIVE begins with a warm-up phase, where the base model is fine-tuned on a small number of annotated examples to learn the structured reasoning design. It is then applied to generate reasoning chains for all training examples, selecting only those that are correct and structurally sound for subsequent self-improvement training. We demonstrate that STRIVE achieves significant improvements over baseline models, with a 31.4% performance gain over the base model and 20.7% over Chain of Thought on the HOVER datasets, highlighting its effectiveness.
中文:STRIVE通过引入包含主张分解、实体分析和证据验证的结构化推理方法,提升了自我改进过程中的推理质量和监督效果,在主张验证任务上相比基线模型取得了显著性能提升。
English: STRIVE introduces a structured reasoning approach with claim decomposition, entity analysis, and evidence grounding to enhance reasoning quality and supervision in self-improvement for claim verification, achieving significant performance gains over baseline models.

Authors:Wenrui Xu, Dalin Lyu, Weihang Wang, Jie Feng, Chen Gao, Yong Li
Title: Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics
Abstract:
The Theory of Multiple Intelligences underscores the hierarchical nature of cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13 mainstream VLMs through nine validated psychometric experiments reveals significant gaps versus humans (average score 24.95 vs. 68.38), with three key findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation, weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading (30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought (0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from architectural constraints. Identified barriers include weak geometry encoding and missing dynamic simulation. By linking psychometric BSAs to VLM capabilities, we provide a diagnostic toolkit for spatial intelligence evaluation, methodological foundations for embodied AI development, and a cognitive science-informed roadmap for achieving human-like spatial intelligence.
中文摘要:本研究通过构建包含五种基本空间能力的心理测量框架评估视觉语言模型,发现其空间推理能力与人类存在显著差距,并识别出几何编码薄弱等关键瓶颈。
English Summary: This study establishes a psychometric framework with five Basic Spatial Abilities to evaluate Visual Language Models, revealing significant performance gaps compared to humans and identifying key limitations in spatial reasoning capabilities.

Authors:Jinghua Piao, Yuwei Yan, Jun Zhang, Nian Li, Junbo Yan, Xiaochong Lan, Zhihong Lu, Zhiheng Zheng, Jing Yi Wang, Di Zhou, Chen Gao, Fengli Xu, Fang Zhang, Ke Rong, Jun Su, Yong Li
Title: AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society
Abstract:
Understanding human behavior and society is a central focus in social sciences, with the rise of generative social science marking a significant paradigmatic shift. By leveraging bottom-up simulations, it replaces costly and logistically challenging traditional experiments with scalable, replicable, and systematic computational approaches for studying complex social dynamics. Recent advances in large language models (LLMs) have further transformed this research paradigm, enabling the creation of human-like generative social agents and realistic simulacra of society. In this paper, we propose AgentSociety, a large-scale social simulator that integrates LLM-driven agents, a realistic societal environment, and a powerful large-scale simulation engine. Based on the proposed simulator, we generate social lives for over 10k agents, simulating their 5 million interactions both among agents and between agents and their environment. Furthermore, we explore the potential of AgentSociety as a testbed for computational social experiments, focusing on four key social issues: polarization, the spread of inflammatory messages, the effects of universal basic income policies, and the impact of external shocks such as hurricanes. These four issues serve as valuable cases for assessing AgentSociety's support for typical research methods -- such as surveys, interviews, and interventions -- as well as for investigating the patterns, causes, and underlying mechanisms of social issues. The alignment between AgentSociety's outcomes and real-world experimental results not only demonstrates its ability to capture human behaviors and their underlying mechanisms, but also underscores its potential as an important platform for social scientists and policymakers.
中文:AgentSociety是一个基于大语言模型的大规模社会模拟器,通过模拟关键社会问题与现实结果的高度一致性,验证了其捕捉人类行为机制的能力,为社会科学研究提供了重要平台。
English: AgentSociety is a large-scale social simulator that uses LLM-driven agents to model complex social dynamics, demonstrating its effectiveness through realistic simulations of key societal issues and alignment with real-world outcomes.

Authors:Asen Nachkov, Danda Pani Paudel, Jan-Nico Zaech, Davide Scaramuzza, Luc Van Gool
Title: Dream to Drive: Model-Based Vehicle Control Using Analytic World Models
Abstract:
Differentiable simulators have recently shown great promise for training autonomous vehicle controllers. Being able to backpropagate through them, they can be placed into an end-to-end training loop where their known dynamics turn into useful priors for the policy to learn, removing the typical black box assumption of the environment. So far, these systems have only been used to train policies. However, this is not the end of the story in terms of what they can offer. Here, for the first time, we use them to train world models. Specifically, we present three new task setups that allow us to learn next state predictors, optimal planners, and optimal inverse states. Unlike analytic policy gradients (APG), which requires the gradient of the next simulator state with respect to the current actions, our proposed setups rely on the gradient of the next state with respect to the current state. We call this approach Analytic World Models (AWMs) and showcase its applications, including how to use it for planning in the Waymax simulator. Apart from pushing the limits of what is possible with such simulators, we offer an improved training recipe that increases performance on the large-scale Waymo Open Motion dataset by up to 12% compared to baselines at essentially no additional cost.
Chinese: 可微分模拟器现被用于通过一种名为解析世界模型(AWM)的新方法训练世界模型,该方法能够学习下一状态预测器、最优规划器和逆状态,在Waymo开放运动数据集上的性能提升高达12%,且无需额外成本。
English: Differentiable simulators are now being used to train world models through a new approach called Analytic World Models (AWMs), which enables learning next state predictors, optimal planners, and inverse states, improving performance on the Waymo Open Motion dataset by up to 12% without extra cost.

Authors:Zhitao He, Zijun Liu, Peng Li, Yi R. Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Title: Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization
Abstract:
LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi-agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi-agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi-agent reinforcement learning framework with a novel multi-agent credit re-assignment (CR) strategy, assigning process rewards with LLMs rather than environment-specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role-free agents' policies. Empirical results show that our framework improves both performance and cross-environment generalizability of multi-agent systems. Moreover, our 7B-parameter system achieves results on par with or exceed strong closed-source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi-agent systems.
中文:提出的CollabUIAgents框架采用多智能体强化学习,通过基于大语言模型的新型信用再分配策略,在交互任务中提升了性能与跨环境泛化能力,并以较小模型实现了与大型模型相媲美的效果。
English: The proposed CollabUIAgents framework employs multi-agent reinforcement learning with a novel credit re-assignment strategy using LLMs to enhance both performance and cross-environment generalization in interactive tasks, achieving competitive results with smaller models.

Authors:Kechi Zhang, Ge Li, Jia Li, Yihong Dong, Jia Li, Zhi Jin
Title: Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points
Abstract:
Code generation models have shown significant potential for automating programming tasks. However, the challenge of generating accurate and reliable code persists due to the highly complex and long-reasoning nature of the task. Even state-of-the-art models often fail in code generation due to small errors, which can drastically affect the overall functionality of code. Our study identifies that current models tend to produce errors concentrated at specific error-prone points, which significantly impacts the accuracy of the generated code. To address this issue, we introduce Focused-DPO, a framework that enhances code generation by directing preference optimization towards these critical error-prone areas. This approach builds on Direct Preference Optimization, emphasizing accuracy in parts prone to errors. Additionally, we develop a method called Error-Point Identification, which constructs a dataset that targets these problematic points without requiring costly human annotations. Our experiments on benchmarks such as HumanEval(+), MBPP(+), and LiveCodeBench demonstrate that Focused-DPO significantly improves the precision and reliability of code generation, reducing common errors and enhancing overall code quality. By focusing on error-prone points, Focused-DPO advances the accuracy and functionality of model-generated code.
中文摘要:Focused-DPO框架通过针对关键易错点进行偏好优化和自动错误识别,有效提升了代码生成的准确性和可靠性,在多个基准测试中表现显著。
English Summary: Focused-DPO enhances code generation accuracy by targeting critical error-prone points through optimized preference learning and automated error identification, significantly improving reliability across multiple benchmarks.

Authors:Tianyuan Zou, Yang Liu, Peng Li, Yufei Xiong, Jianqing Zhang, Jingjing Liu, Xiaozhou Ye, Ye Ouyang, Ya-Qin Zhang
Title: Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion
Abstract:
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.
中文: 该摘要提出WASP框架,通过加权多个预训练模型和对比生成方法,在数据稀缺情况下改进私有数据合成,有效提升下游任务性能。
English: The abstract introduces WASP, a framework that enhances private data synthesis by using weighted multiple pre-trained models and a contrastive approach to overcome limitations in data-deficient scenarios, demonstrating superior performance across various tasks.

Authors:Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, Zhaoxiang Zhang
Title: CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Abstract:
The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.
中文:CodeCriticBench通过引入针对不同难度代码任务的综合评估协议,解决了现有大语言模型批判能力基准的不足,实验结果验证了其有效性。
English: The proposed CodeCriticBench addresses limitations in existing LLM critique benchmarks by introducing comprehensive evaluation protocols for code tasks of varying difficulty, with experimental results demonstrating its effectiveness.

Authors:Yifu Ding, Wentao Jiang, Shunyu Liu, Yongcheng Jing, Jinyang Guo, Yingjie Wang, Jing Zhang, Zengmao Wang, Ziwei Liu, Bo Du, Xianglong Liu, Dacheng Tao
Title: Dynamic Parallel Tree Search for Efficient LLM Reasoning
Abstract:
Tree of Thoughts (ToT) enhances Large Language Model (LLM) reasoning by structuring problem-solving as a spanning tree. However, recent methods focus on search accuracy while overlooking computational efficiency. The challenges of accelerating the ToT lie in the frequent switching of reasoning focus, and the redundant exploration of suboptimal solutions. To alleviate this dilemma, we propose Dynamic Parallel Tree Search (DPTS), a novel parallelism framework that aims to dynamically optimize the reasoning path in inference. It includes the Parallelism Streamline in the generation phase to build up a flexible and adaptive parallelism with arbitrary paths by fine-grained cache management and alignment. Meanwhile, the Search and Transition Mechanism filters potential candidates to dynamically maintain the reasoning focus on more possible solutions and have less redundancy. Experiments on Qwen-2.5 and Llama-3 with Math500 and GSM8K datasets show that DPTS significantly improves efficiency by 2-4x on average while maintaining or even surpassing existing reasoning algorithms in accuracy, making ToT-based reasoning more scalable and computationally efficient.
中文:提出的动态并行树搜索(DPTS)框架通过并行路径生成和动态候选筛选优化了思维树推理的计算效率,在基准测试中实现2-4倍加速的同时保持了推理精度。
English: The proposed Dynamic Parallel Tree Search (DPTS) framework enhances Tree of Thoughts reasoning by optimizing computational efficiency through parallel path generation and dynamic candidate filtering, achieving 2-4x speed improvements while maintaining accuracy on benchmark tests.

Authors:P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixin Deng, Shawn Gavin, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, David Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tyshawn Hsing, Ming Xu, Zhenzhu Yang, Zekun Moore Wang, Junting Zhou, Yuelin Bai, Xingyuan Bu, Chenglin Cai, Liang Chen, Yifan Chen, Chengtuo Cheng, Tianhao Cheng, Keyi Ding, Siming Huang, Yun Huang, Yaoru Li, Yizhe Li, Zhaoqun Li, Tianhao Liang, Chengdong Lin, Hongquan Lin, Yinghao Ma, Tianyang Pang, Zhongyuan Peng, Zifan Peng, Qige Qi, Shi Qiu, Xingwei Qu, Shanghaoran Quan, Yizhou Tan, Zili Wang, Chenqing Wang, Hao Wang, Yiya Wang, Yubo Wang, Jiajun Xu, Kexin Yang, Ruibin Yuan, Yuanhao Yue, Tianyang Zhan, Chun Zhang, Jinyang Zhang, Xiyue Zhang, Xingjian Zhang, Yue Zhang, Yongchi Zhao, Xiangyu Zheng, Chenghua Zhong, Yang Gao, Zhoujun Li, Dayiheng Liu, Qian Liu, Tianyu Liu, Shiwen Ni, Junran Peng, Yujia Qin, Wenbo Su, Guoyin Wang, Shi Wang, Jian Yang, Min Yang, Meng Cao, Xiang Yue, Zhaoxiang Zhang, Wangchunshu Zhou, Jiaheng Liu, Qunshu Lin, Wenhao Huang, Ge Zhang
Title: SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Abstract:
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.
中文: 大语言模型在主流学科表现优异,但在众多专业领域评估不足,为此我们开发了SuperGPQA综合基准,通过人机协作机制揭示了当前模型与通用人工智能的显著差距,并为大规模研究提供了方法论指导。
English: Large language models show strong performance in mainstream disciplines but lack evaluation in over 200 specialized fields, prompting the creation of SuperGPQA—a comprehensive benchmark revealing significant performance gaps and providing methodological insights through human-LLM collaboration.

Authors:Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li
Title: SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
Abstract:
The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.
中文: 本文提出了SimpleVQA,首个全面评估多模态大语言模型事实性能力的基准,涵盖多任务多场景的高质量挑战性问题,采用严格质量控制和LLM评分系统,对18个多模态模型和8个纯文本模型进行了图像理解与文本生成能力的系统评估。
English: This paper introduces SimpleVQA, a comprehensive multi-modal benchmark designed to evaluate the factuality of multi-modal large language models (MLLMs) through diverse tasks and scenarios, featuring rigorous quality control and an LLM-as-a-judge scoring system to assess 18 MLLMs and 8 text-only LLMs.

Authors:Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
Title: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Abstract:
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
中文: Sailor2是面向东南亚语言的先进多语言模型系列,在性能上可与GPT-4o媲美,并附带完整开发指南以促进包容性语言AI发展。
English: Sailor2 is a family of advanced multilingual models for Southeast Asian languages, achieving competitive performance against GPT-4o and including a comprehensive development guide to foster inclusive language AI.

Authors:Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang
Title: Mobile-Agent-V: A Video-Guided Approach for Effortless and Efficient Operational Knowledge Injection in Mobile Automation
Abstract:
The exponential rise in mobile device usage necessitates streamlined automation for effective task management, yet many AI frameworks fall short due to inadequate operational expertise. While manually written knowledge can bridge this gap, it is often burdensome and inefficient. We introduce Mobile-Agent-V, an innovative framework that utilizes video as a guiding tool to effortlessly and efficiently inject operational knowledge into mobile automation processes. By deriving knowledge directly from video content, Mobile-Agent-V eliminates manual intervention, significantly reducing the effort and time required for knowledge acquisition. To rigorously evaluate this approach, we propose Mobile-Knowledge, a benchmark tailored to assess the impact of external knowledge on mobile agent performance. Our experimental findings demonstrate that Mobile-Agent-V enhances performance by 36% compared to existing methods, underscoring its effortless and efficient advantages in mobile automation.
Chinese: Mobile-Agent-V 提出了一种以视频为引导的框架,能够直接从视频内容中获取操作知识,无需人工干预,并将移动自动化性能相比现有方法提升了 36%。
English: Mobile-Agent-V introduces a video-guided framework that automatically derives operational knowledge from videos, eliminating manual input and boosting mobile automation performance by 36% over current methods.

Authors:Quanjun Zhang, Chunrong Fang, Yi Zheng, Yaxin Zhang, Yuan Zhao, Rubing Huang, Jianyi Zhou, Yun Yang, Tao Zheng, Zhenyu Chen
Title: Improving Deep Assertion Generation via Fine-Tuning Retrieval-Augmented Pre-trained Language Models
Abstract:
Unit testing validates the correctness of the units of the software system under test and serves as the cornerstone in improving software quality and reliability. To reduce manual efforts in writing unit tests, some techniques have been proposed to automatically generate test assertions, with recent integration-based approaches considered state-of-the-art. Despite being promising, such integration-based approaches face several limitations, including reliance on lexical matching for assertion retrieval and a limited training corpus for assertion generation. This paper proposes a novel retrieval-augmented deep assertion generation approach, namely RetriGen, based on a hybrid retriever and a pre-trained language model (PLM)-based generator. Given a focal-test, RetriGen first builds a hybrid assertion retriever to search for the most relevant Test-Assert Pair from external codebases. The retrieval process considers lexical similarity and semantical similarity via a token-based and an embedding-based retriever, respectively. RetriGen then treats assertion generation as a sequence-to-sequence task and designs a PLM-based assertion generator to predict a correct assertion. We conduct extensive experiments to evaluate RetriGen against six state-of-the-art approaches across two large-scale datasets and two metrics. The results demonstrate that RetriGen achieves 57.66% accuracy and 73.24% CodeBLEU, outperforming all baselines with average improvements of 50.66% and 14.14%, respectively.
中文: 本文提出RetriGen,一种结合混合检索器和预训练语言模型的新方法,用于自动生成单元测试断言,在准确性和CodeBLEU指标上显著优于现有技术。
English: This paper introduces RetriGen, a novel retrieval-augmented approach that combines a hybrid retriever and a pre-trained language model to automatically generate unit test assertions, significantly outperforming existing methods in accuracy and CodeBLEU scores.

Authors:Weisong Sun, Yuchen Chen, Mengzhe Yuan, Chunrong Fang, Zhenpeng Chen, Chong Wang, Yang Liu, Baowen Xu, Zhenyu Chen
Title: Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness
Abstract:
Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks. In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillBadCode. KillBadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBadCode first builds a code language model (CodeLM) on a lightweight $n$-gram language model. Then, given poisoned data, KillBadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillBadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. The experimental results on two code poisoning attacks and four code intelligence tasks demonstrate that KillBadCode significantly outperforms four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average.
中文: 神经代码模型面临通过污染训练数据实施的代码投毒攻击威胁,而新型轻量级检测技术KillBadCode通过分析代码自然性扰动,能有效识别并清除被投毒样本,在检测效率和性能上显著优于现有方法。
English: Neural code models face security threats from code poisoning attacks that manipulate training data, but the proposed lightweight detection technique KillBadCode effectively identifies and removes poisoned samples by analyzing code naturalness disruptions, demonstrating superior efficiency and performance over existing methods.

Authors:Quanjun Zhang, Chunrong Fang, Yi Zheng, Ruixiang Qian, Shengcheng Yu, Yuan Zhao, Jianyi Zhou, Yun Yang, Tao Zheng, Zhenyu Chen
Title: Improving Retrieval-Augmented Deep Assertion Generation via Joint Training
Abstract:
Unit testing attempts to validate the correctness of basic units of the software system under test and has a crucial role in software development and testing. Very recent work proposes a retrieve-and-edit approach to generate unit test oracles, i.e., assertions. Despite being promising, it is still far from perfect due to some limitations, such as splitting assertion retrieval and generation into two separate components without benefiting each other. In this paper, we propose AG-RAG, a retrieval-augmented automated assertion generation approach that leverages external codebases and joint training to address various technical limitations of prior work. Inspired by the plastic surgery hypothesis, AG-RAG attempts to combine relevant unit tests and advanced pre-trained language models (PLMs) with retrieval-augmented fine-tuning. AG-RAG builds a dense retriever to search for relevant test-assert pairs (TAPs) with semantic matching and a retrieval-augmented generator to synthesize accurate assertions with the focal-test and retrieved TAPs as input. Besides, AG-RAG leverages a code-aware language model CodeT5 as the cornerstone to facilitate both assertion retrieval and generation tasks. Furthermore, the retriever is optimized in conjunction with the generator as a whole pipeline with a joint training strategy. This unified design fully adapts both components specifically for retrieving more useful TAPs, thereby generating accurate assertions. We extensively evaluate AG-RAG against six state-of-the-art AG approaches on two benchmarks and three metrics. Experimental results show that AG-RAG significantly outperforms previous AG approaches on all benchmarks and metrics, e.g., improving the most recent baseline EditAS by 20.82% and 26.98% in terms of accuracy. AG-RAG also correctly generates 1739 and 2866 unique assertions that all baselines fail to generate, 3.45X and 9.20X more than EditAS.
中文: 本文提出AG-RAG方法,通过检索增强的自动断言生成技术,结合联合训练策略利用外部代码库,有效解决了先前单元测试生成方法的局限性,在准确性和独特断言生成数量上均显著优于现有最佳方法。
English: This paper introduces AG-RAG, a retrieval-augmented automated assertion generation approach that integrates joint training and leverages external codebases to overcome limitations in prior unit test generation methods, demonstrating significant improvements in accuracy and unique assertion generation over existing techniques.

Authors:Ziqing Yang, Yixin Wu, Rui Wen, Michael Backes, Yang Zhang
Title: Peering Behind the Shield: Guardrail Identification in Large Language Models
Abstract:
Human-AI conversations have gained increasing attention since the era of large language models. Consequently, more techniques, such as input/output guardrails and safety alignment, are proposed to prevent potential misuse of such Human-AI conversations. However, the ability to identify these guardrails has significant implications, both for adversarial exploitation and for auditing purposes by red team operators. In this work, we propose a novel method, AP-Test, which identifies the presence of a candidate guardrail by leveraging guardrail-specific adversarial prompts to query the AI agent. Extensive experiments of four candidate guardrails under diverse scenarios showcase the effectiveness of our method. The ablation study further illustrates the importance of the components we designed, such as the loss terms.
中文: AP-Test方法通过对抗性提示有效检测AI防护栏,在多种场景下展现出高效性,并验证了其设计组件的重要性。
English: The AP-Test method effectively detects AI guardrails using adversarial prompts, demonstrating high accuracy across various scenarios and highlighting the importance of its designed components.

Authors:Yixin Wu, Ziqing Yang, Yun Shen, Michael Backes, Yang Zhang
Title: Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications
Abstract:
Large language models (LLMs) have facilitated the generation of high-quality, cost-effective synthetic data for developing downstream models and conducting statistical analyses in various domains. However, the increased reliance on synthetic data may pose potential negative impacts. Numerous studies have demonstrated that LLM-generated synthetic data can perpetuate and even amplify societal biases and stereotypes, and produce erroneous outputs known as ``hallucinations'' that deviate from factual knowledge. In this paper, we aim to audit artifacts, such as classifiers, generators, or statistical plots, to identify those trained on or derived from synthetic data and raise user awareness, thereby reducing unexpected consequences and risks in downstream applications. To this end, we take the first step to introduce synthetic artifact auditing to assess whether a given artifact is derived from LLM-generated synthetic data. We then propose an auditing framework with three methods including metric-based auditing, tuning-based auditing, and classification-based auditing. These methods operate without requiring the artifact owner to disclose proprietary training details. We evaluate our auditing framework on three text classification tasks, two text summarization tasks, and two data visualization tasks across three training scenarios. Our evaluation demonstrates the effectiveness of all proposed auditing methods across all these tasks. For instance, black-box metric-based auditing can achieve an average accuracy of $0.868 \pm 0.071$ for auditing classifiers and $0.880 \pm 0.052$ for auditing generators using only 200 random queries across three scenarios. We hope our research will enhance model transparency and regulatory compliance, ensuring the ethical and responsible use of synthetic data.
中文:本文提出合成制品审计框架,用于检测模型或输出是否源自大语言模型生成的合成数据,旨在减轻偏见放大和幻觉等风险,确保数据使用的伦理合规。
English: This paper introduces a synthetic artifact auditing framework to detect if models or outputs are derived from LLM-generated synthetic data, aiming to mitigate risks like bias amplification and hallucinations while ensuring ethical data usage.

Authors:Tianqi Zhang, Zheng Wu, Yuxin Chen, Yixiao Wang, Boyuan Liang, Scott Moura, Masayoshi Tomizuka, Mingyu Ding, Wei Zhan
Title: Physics-Aware Robotic Palletization with Online Masking Inference
Abstract:
The efficient planning of stacking boxes, especially in the online setting where the sequence of item arrivals is unpredictable, remains a critical challenge in modern warehouse and logistics management. Existing solutions often address box size variations, but overlook their intrinsic and physical properties, such as density and rigidity, which are crucial for real-world applications. We use reinforcement learning (RL) to solve this problem by employing action space masking to direct the RL policy toward valid actions. Unlike previous methods that rely on heuristic stability assessments which are difficult to assess in physical scenarios, our framework utilizes online learning to dynamically train the action space mask, eliminating the need for manual heuristic design. Extensive experiments demonstrate that our proposed method outperforms existing state-of-the-arts. Furthermore, we deploy our learned task planner in a real-world robotic palletizer, validating its practical applicability in operational settings.
中文: 我们采用强化学习和动作空间掩码的方法,通过动态训练掩码来整合密度和刚性等物理属性,有效解决了在线箱体堆叠难题,其性能优于现有技术,并在实际机器人应用中验证了实用性。
English: Our reinforcement learning approach with action space masking efficiently addresses the online box stacking challenge by dynamically training the mask to incorporate physical properties like density and rigidity, outperforming existing methods and proving effective in real-world robotic applications.

Authors:Jinluan Yang, Dingnan Jin, Anke Tang, Li Shen, Didi Zhu, Zhengyu Chen, Ziyu Zhao, Daixin Wang, Qing Cui, Zhiqiang Zhang, Jun Zhou, Fei Wu, Kun Kuang
Title: Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging
Abstract:
Achieving balanced alignment of large language models (LLMs) in terms of Helpfulness, Honesty, and Harmlessness (3H optimization) constitutes a cornerstone of responsible AI. Existing methods like data mixture strategies face limitations, including heavy reliance on expert knowledge and conflicting optimization signals. While model merging offers parameter-level conflict-resolution strategies through integrating specialized models' parameters, its potential for 3H optimization remains underexplored. This paper systematically compares the effectiveness of model merging and data mixture methods in constructing 3H-aligned LLMs for the first time, revealing previously overlooked collaborative and conflict relationships among the 3H dimensions and discussing the advantages and drawbacks of data mixture (\textit{data-level}) and model merging (\textit{parameter-level}) methods in mitigating the conflict for balanced 3H optimization. Specially, we propose a novel \textbf{R}eweighting \textbf{E}nhanced task \textbf{S}ingular \textbf{M}erging method, \textbf{RESM}, through outlier weighting and sparsity-aware rank selection strategies to address the challenges of preference noise accumulation and layer sparsity adaptation inherent in 3H-aligned LLM merging. Extensive evaluations can verify the effectiveness and robustness of RESM compared to previous data mixture (2\%-5\% gain) and model merging (1\%-3\% gain) methods in achieving balanced LLM alignment. We release our models through \href{https://huggingface.co/Jinluan}{3H\_Merging} for further investigations.
中文摘要:本文提出RESM方法,通过处理偏好噪声和层级稀疏性来优化大型语言模型在有益性、诚实性和无害性上的平衡对齐,相比现有数据混合和模型合并方法分别提升2%-5%和1%-3%的性能。
English Summary: This paper introduces RESM, a novel model merging method that enhances balanced alignment of large language models across Helpfulness, Honesty, and Harmlessness by addressing preference noise and layer sparsity, outperforming existing data mixture and model merging approaches with 2%-5% and 1%-3% gains respectively.

Authors:Fen Liu, Shenghai Yuan, Wei Meng, Rong Su, Lihua Xie
Title: Non-cooperative Stochastic Target Encirclement by Anti-synchronization Control via Range-only Measurement
Abstract:
This paper investigates the stochastic moving target encirclement problem in a realistic setting. In contrast to typical assumptions in related works, the target in our work is non-cooperative and capable of escaping the circle containment by boosting its speed to maximum for a short duration. Considering the extreme environment, such as GPS denial, weight limit, and lack of ground guidance, two agents can only rely on their onboard single-modality perception tools to measure the distances to the target. The distance measurement allows for creating a position estimator by providing a target position-dependent variable. Furthermore, the construction of the unique distributed anti-synchronization controller (DASC) can guarantee that the two agents track and encircle the target swiftly. The convergence of the estimator and controller is rigorously evaluated using the Lyapunov technique. A real-world UAV-based experiment is conducted to illustrate the performance of the proposed methodology in addition to a simulated Matlab numerical sample. Our video demonstration can be found in the URL https://youtu.be/JXu1gib99yQ.
本文针对随机移动目标包围问题,提出了一种分布式反同步控制器和位置估计器,在GPS拒止环境下通过李雅普诺夫分析和真实无人机实验验证了其有效性。
This paper addresses the stochastic moving target encirclement problem using a distributed anti-synchronization controller and a position estimator, validated through Lyapunov analysis and real-world UAV experiments under GPS-denied conditions.

Authors:Hongyi Chen, Jingtao Ding, Jianhai Shu, Xinchun Yu, Xiaojun Liang, Yong Li, Xiao-Ping Zhang
Title: Sample-efficient diffusion-based control of complex nonlinear systems
Abstract:
Complex nonlinear system control faces challenges in achieving sample-efficient, reliable performance. While diffusion-based methods have demonstrated advantages over classical and reinforcement learning approaches in long-term control performance, they are limited by sample efficiency. This paper presents SEDC (Sample-Efficient Diffusion-based Control), a novel diffusion-based control framework addressing three core challenges: high-dimensional state-action spaces, nonlinear system dynamics, and the gap between non-optimal training data and near-optimal control solutions. Through three innovations - Decoupled State Diffusion, Dual-Mode Decomposition, and Guided Self-finetuning - SEDC achieves 39.5\%-49.4\% better control accuracy than baselines while using only 10\% of the training samples, as validated across three complex nonlinear dynamic systems. Our approach represents a significant advancement in sample-efficient control of complex nonlinear systems. The implementation of the code can be found at https://anonymous.4open.science/r/DIFOCON-C019.
中文: 本文提出的SEDC框架通过解耦状态扩散、双模分解和引导自微调三项创新,在仅使用10%训练样本的情况下,将控制精度提升39.5%-49.4%,有效解决了高维空间、非线性动态及数据与控制间差距三大核心难题。
English: This paper introduces SEDC, a diffusion-based control framework that enhances sample efficiency and achieves 39.5%-49.4% higher control accuracy using only 10% of training samples through three innovations addressing high-dimensional spaces, nonlinear dynamics, and data-to-control gaps.

Authors:Zhi Sheng, Yuan Yuan, Yudi Zhang, Jingtao Ding, Yong Li
Title: Collaborative Deterministic-Probabilistic Forecasting for Diverse Spatiotemporal Systems
Abstract:
Probabilistic forecasting is crucial for real-world spatiotemporal systems, such as climate, energy, and urban environments, where quantifying uncertainty is essential for informed, risk-aware decision-making. While diffusion models have shown promise in capturing complex data distributions, their application to spatiotemporal forecasting remains limited due to complex spatiotemporal dynamics and high computational demands. we propose CoST, a general forecasting framework that collaborates deterministic and diffusion models for diverse spatiotemporal systems. CoST formulates a mean-residual decomposition strategy: it leverages a powerful deterministic model to capture the conditional mean and a lightweight diffusion model to learn residual uncertainties. This collaborative formulation simplifies learning objectives, improves accuracy and efficiency, and generalizes across diverse spatiotemporal systems. To address spatial heterogeneity, we further design a scale-aware diffusion mechanism to guide the diffusion process. Extensive experiments across ten real-world datasets from climate, energy, communication, and urban systems show that CoST achieves 25\% performance gains over state-of-the-art baselines, while significantly reducing computational cost.
中文: CoST框架通过确定性模型与轻量扩散模型的协作,采用均值-残差分解策略处理时空预测问题,在十个真实数据集上实现25%的性能提升并显著降低计算成本。
English: CoST is a collaborative spatiotemporal forecasting framework that combines deterministic models for capturing conditional means with lightweight diffusion models for learning residual uncertainties, achieving 25% performance gains and reduced computational costs across diverse real-world systems.

Authors:Alex Jinpeng Wang, Dongxing Mao, Jiawei Zhang, Weiming Han, Zhuobai Dong, Linjie Li, Yiqi Lin, Zhengyuan Yang, Libo Qin, Fuwei Zhang, Lijuan Wang, Min Li
Title: TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Abstract:
Text-conditioned image generation has gained significant attention in recent years and are processing increasingly longer and comprehensive text prompt. In everyday life, dense and intricate text appears in contexts like advertisements, infographics, and signage, where the integration of both text and visuals is essential for conveying complex information. However, despite these advances, the generation of images containing long-form text remains a persistent challenge, largely due to the limitations of existing datasets, which often focus on shorter and simpler text. To address this gap, we introduce TextAtlas5M, a novel dataset specifically designed to evaluate long-text rendering in text-conditioned image generation. Our dataset consists of 5 million long-text generated and collected images across diverse data types, enabling comprehensive evaluation of large-scale generative models on long-text image generation. We further curate 3000 human-improved test set TextAtlasEval across 3 data domains, establishing one of the most extensive benchmarks for text-conditioned generation. Evaluations suggest that the TextAtlasEval benchmarks present significant challenges even for the most advanced proprietary models (e.g. GPT4o with DallE-3), while their open-source counterparts show an even larger performance gap. These evidences position TextAtlas5M as a valuable dataset for training and evaluating future-generation text-conditioned image generation models.
中文: TextAtlas5M数据集专为解决长文本图像生成的难题而设计,包含500万张多样化图像和精选测试集,对现有先进模型构成显著挑战,是未来文本条件图像生成模型训练与评估的重要资源。
English: The TextAtlas5M dataset is introduced to address the challenge of generating images with long-form text, providing 5 million diverse images and a curated test set that poses significant difficulties for current advanced models.

Authors:Meet Udeshi, Minghao Shao, Haoran Xi, Nanda Rani, Kimberly Milner, Venkata Sai Charan Putrevu, Brendan Dolan-Gavitt, Sandeep Kumar Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Ramesh Karri, Muhammad Shafique
Title: D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive Security
Abstract:
Large Language Models (LLMs) have been used in cybersecurity such as autonomous security analysis or penetration testing. Capture the Flag (CTF) challenges serve as benchmarks to assess automated task-planning abilities of LLM agents for cybersecurity. Early attempts to apply LLMs for solving CTF challenges used single-agent systems, where feedback was restricted to a single reasoning-action loop. This approach was inadequate for complex CTF tasks. Inspired by real-world CTF competitions, where teams of experts collaborate, we introduce the D-CIPHER LLM multi-agent framework for collaborative CTF solving. D-CIPHER integrates agents with distinct roles with dynamic feedback loops to enhance reasoning on complex tasks. It introduces the Planner-Executor agent system, consisting of a Planner agent for overall problem-solving along with multiple heterogeneous Executor agents for individual tasks, facilitating efficient allocation of responsibilities among the agents. Additionally, D-CIPHER incorporates an Auto-prompter agent to improve problem-solving by auto-generating a highly relevant initial prompt. We evaluate D-CIPHER on multiple CTF benchmarks and LLM models via comprehensive studies to highlight the impact of our enhancements. Additionally, we manually map the CTFs in NYU CTF Bench to MITRE ATT&CK techniques that apply for a comprehensive evaluation of D-CIPHER's offensive security capability. D-CIPHER achieves state-of-the-art performance on three benchmarks: 22.0% on NYU CTF Bench, 22.5% on Cybench, and 44.0% on HackTheBox, which is 2.5% to 8.5% better than previous work. D-CIPHER solves 65% more ATT&CK techniques compared to previous work, demonstrating stronger offensive capability.
中文:D-CIPHER框架采用多智能体协作系统,通过动态反馈循环和角色分工显著提升大语言模型在网络安全夺旗挑战中的表现,在多项基准测试中创下最优成绩。
English: The D-CIPHER framework introduces a multi-agent system with specialized roles and dynamic feedback loops to enhance LLM performance in cybersecurity CTF challenges, achieving state-of-the-art results across multiple benchmarks.

Authors:Suqin Yuan, Lei Feng, Bo Han, Tongliang Liu
Title: Enhancing Sample Selection Against Label Noise by Cutting Mislabeled Easy Examples
Abstract:
Sample selection is a prevalent approach in learning with noisy labels, aiming to identify confident samples for training. Although existing sample selection methods have achieved decent results by reducing the noise rate of the selected subset, they often overlook that not all mislabeled examples harm the model's performance equally. In this paper, we demonstrate that mislabeled examples correctly predicted by the model early in the training process are particularly harmful to model performance. We refer to these examples as Mislabeled Easy Examples (MEEs). To address this, we propose Early Cutting, which introduces a recalibration step that employs the model's later training state to re-select the confident subset identified early in training, thereby avoiding misleading confidence from early learning and effectively filtering out MEEs. Experiments on the CIFAR, WebVision, and full ImageNet-1k datasets demonstrate that our method effectively improves sample selection and model performance by reducing MEEs.
中文: 样本选择方法在处理含噪声标签的学习中常忽视误标样本危害程度的差异,尤其早期被正确预测的误标易例(MEEs)会损害模型性能,因此提出Early Cutting方法,利用后期训练状态重新校准早期识别的置信子集,有效过滤MEEs,在CIFAR和ImageNet-1k等数据集上验证了其有效性。
English: Sample selection methods in learning with noisy labels often overlook the varying harm of mislabeled examples, particularly Mislabeled Easy Examples (MEEs) that are correctly predicted early but damage performance, prompting the proposed Early Cutting method to recalibrate confident subsets using later training states and effectively filter out MEEs, as validated on datasets like CIFAR and ImageNet-1k.

Authors:Suqin Yuan, Runqi Lin, Lei Feng, Bo Han, Tongliang Liu
Title: Instance-dependent Early Stopping
Abstract:
In machine learning practice, early stopping has been widely used to regularize models and can save computational costs by halting the training process when the model's performance on a validation set stops improving. However, conventional early stopping applies the same stopping criterion to all instances without considering their individual learning statuses, which leads to redundant computations on instances that are already well-learned. To further improve the efficiency, we propose an Instance-dependent Early Stopping (IES) method that adapts the early stopping mechanism from the entire training set to the instance level, based on the core principle that once the model has mastered an instance, the training on it should stop. IES considers an instance as mastered if the second-order differences of its loss value remain within a small range around zero. This offers a more consistent measure of an instance's learning status compared with directly using the loss value, and thus allows for a unified threshold to determine when an instance can be excluded from further backpropagation. We show that excluding mastered instances from backpropagation can increase the gradient norms, thereby accelerating the decrease of the training loss and speeding up the training process. Extensive experiments on benchmarks demonstrate that IES method can reduce backpropagation instances by 10%-50% while maintaining or even slightly improving the test accuracy and transfer learning performance of a model.
中文: 提出的实例相关早停方法通过根据单个样本损失稳定情况自适应停止反向传播,在保持模型性能的同时将计算量减少10%-50%,从而显著提升训练效率。
English: The proposed Instance-dependent Early Stopping (IES) method improves training efficiency by adaptively halting backpropagation for individual instances once their loss stabilization indicates mastery, reducing computations by 10%-50% while preserving model performance.

Authors:Siqi Shen, Yu Liu, Daniel Biggs, Omar Hafez, Jiandong Yu, Wentao Zhang, Bin Cui, Jiulong Shan
Title: Transfer learning in Scalable Graph Neural Network for Improved Physical Simulation
Abstract:
In recent years, Graph Neural Network (GNN) based models have shown promising results in simulating physics of complex systems. However, training dedicated graph network based physics simulators can be costly, as most models are confined to fully supervised training, which requires extensive data generated from traditional physics simulators. To date, how transfer learning could improve the model performance and training efficiency has remained unexplored. In this work, we introduce a pre-training and transfer learning paradigm for graph network simulators. We propose the scalable graph U-net (SGUNET). Incorporating an innovative depth-first search (DFS) pooling, the SGUNET is adaptable to different mesh sizes and resolutions for various simulation tasks. To enable the transfer learning between differently configured SGUNETs, we propose a set of mapping functions to align the parameters between the pre-trained model and the target model. An extra normalization term is also added into the loss to constrain the difference between the pre-trained weights and target model weights for better generalization performance. To pre-train our physics simulator we created a dataset which includes 20,000 physical simulations of randomly selected 3D shapes from the open source A Big CAD (ABC) dataset. We show that our proposed transfer learning methods allow the model to perform even better when fine-tuned with small amounts of training data than when it is trained from scratch with full extensive dataset. On the 2D Deformable Plate benchmark dataset, our pre-trained model fine-tuned on 1/16 of the training data achieved an 11.05\% improvement in position RMSE compared to the model trained from scratch.
中文: 本研究提出了一种采用深度优先搜索池化的可扩展图U-net物理模拟器,通过迁移学习在少量微调数据下实现了优于全监督训练的性能表现。
English: This study introduces a scalable graph U-net with DFS pooling for physics simulation, enabling efficient transfer learning that outperforms full supervised training with minimal fine-tuning data.

Authors:Yu Bo, Weian Mao, Yanjun Shao, Weiqiang Bai, Peng Ye, Xinzhu Ma, Junbo Zhao, Hao Chen, Chunhua Shen
Title: Revisiting Convolution Architecture in the Realm of DNA Foundation Models
Abstract:
In recent years, a variety of methods based on Transformer and state space model (SSM) architectures have been proposed, advancing foundational DNA language models. However, there is a lack of comparison between these recent approaches and the classical architecture convolutional networks (CNNs) on foundation model benchmarks. This raises the question: are CNNs truly being surpassed by these recent approaches based on transformer and SSM architectures? In this paper, we develop a simple but well-designed CNN-based method termed ConvNova. ConvNova identifies and proposes three effective designs: 1) dilated convolutions, 2) gated convolutions, and 3) a dual-branch framework for gating mechanisms. Through extensive empirical experiments, we demonstrate that ConvNova significantly outperforms recent methods on more than half of the tasks across several foundation model benchmarks. For example, in histone-related tasks, ConvNova exceeds the second-best method by an average of 5.8%, while generally utilizing fewer parameters and enabling faster computation. In addition, the experiments observed findings that may be related to biological characteristics. This indicates that CNNs are still a strong competitor compared to Transformers and SSMs. We anticipate that this work will spark renewed interest in CNN-based methods for DNA foundation models.
中文: 本文提出的ConvNova这一精心设计的CNN方法在多个基础模型基准测试中超越了近期基于Transformer和SSM的方法,表明卷积神经网络在DNA语言模型中仍是强有力的竞争者。
English: This paper introduces ConvNova, a well-designed CNN-based method that outperforms recent Transformer and SSM approaches on multiple foundation model benchmarks, demonstrating CNNs remain strong competitors in DNA language modeling.

Authors:Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
Title: DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Abstract:
Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo.
中文: 本文提出了DICEPTION视觉通用模型,通过利用预训练扩散模型并保留其先验知识,能以极少数据和计算资源高效处理多种感知任务。
English: This paper introduces DICEPTION, a robust visual generalist model that efficiently tackles multiple perception tasks with minimal data and computational resources by leveraging pre-trained diffusion models while preserving their prior knowledge.

Authors:Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Yao Mu, Hongyuan Zhang, Wenqi Shao, Ping Luo
Title: Text2World: Benchmarking Large Language Models for Symbolic World Model Generation
Abstract:
Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.
中文: 研究者提出基于PDDDL的Text2World基准,通过多维度执行指标解决大语言模型构建世界模型时的评估缺陷,发现强化学习训练的逻辑模型表现最优但仍存局限,同时提出扩展策略并为后续研究奠定基础。
English: Researchers introduce Text2World, a PDDL-based benchmark addressing evaluation challenges in LLM-generated world models, revealing that reinforcement learning-trained reasoning models perform best but still have limitations, while proposing enhancement strategies and establishing a foundation for future research.

Authors:Xiaoyuan Li, Moxin Li, Rui Men, Yichang Zhang, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu, Junyang Lin
Title: HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning
Abstract:
Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.
中文: 本研究通过推出包含11,200个案例的双语基准HellaSwag-Pro,评估大语言模型的常识推理鲁棒性,发现模型在不同语言和问题变体中的表现存在显著差异。
English: This study evaluates the robustness of large language models in commonsense reasoning by introducing HellaSwag-Pro, a bilingual benchmark with 11,200 cases, revealing that models perform inconsistently across languages and question variations.

Authors:Xinyu Lin, Haihan Shi, Wenjie Wang, Fuli Feng, Qifan Wang, See-Kiong Ng, Tat-Seng Chua
Title: Order-agnostic Identifier for Large Language Model-based Generative Recommendation
Abstract:
Leveraging Large Language Models (LLMs) for generative recommendation has attracted significant research interest, where item tokenization is a critical step. It involves assigning item identifiers for LLMs to encode user history and generate the next item. Existing approaches leverage either token-sequence identifiers, representing items as discrete token sequences, or single-token identifiers, using ID or semantic embeddings. Token-sequence identifiers face issues such as the local optima problem in beam search and low generation efficiency due to step-by-step generation. In contrast, single-token identifiers fail to capture rich semantics or encode Collaborative Filtering (CF) information, resulting in suboptimal performance. To address these issues, we propose two fundamental principles for item identifier design: 1) integrating both CF and semantic information to fully capture multi-dimensional item information, and 2) designing order-agnostic identifiers without token dependency, mitigating the local optima issue and achieving simultaneous generation for generation efficiency. Accordingly, we introduce a novel set identifier paradigm for LLM-based generative recommendation, representing each item as a set of order-agnostic tokens. To implement this paradigm, we propose SETRec, which leverages CF and semantic tokenizers to obtain order-agnostic multi-dimensional tokens. To eliminate token dependency, SETRec uses a sparse attention mask for user history encoding and a query-guided generation mechanism for simultaneous token generation. We instantiate SETRec on T5 and Qwen (from 1.5B to 7B). Extensive experiments demonstrate its effectiveness under various scenarios (e.g., full ranking, warm- and cold-start ranking, and various item popularity groups). Moreover, results validate SETRec's superior efficiency and show promising scalability on cold-start items as model sizes increase.
中文摘要:本文提出SETRec,一种基于大语言模型的生成式推荐系统,通过采用结合协同过滤和语义信息的无序集合标识符解决现有项目标识方法的局限性,在多种场景下实现了更好的性能与效率。
English Summary: This paper introduces SETRec, a novel generative recommendation system using Large Language Models that addresses limitations of existing item identifier methods by employing order-agnostic set identifiers combining collaborative filtering and semantic information, achieving improved performance and efficiency across various scenarios.

Authors:Yi Fang, Wenjie Wang, Yang Zhang, Fengbin Zhu, Qifan Wang, Fuli Feng, Xiangnan He
Title: Reason4Rec: Large Language Models for Recommendation with Deliberative User Preference Alignment
Abstract:
While recent advancements in aligning Large Language Models (LLMs) with recommendation tasks have shown great potential and promising performance overall, these aligned recommendation LLMs still face challenges in complex scenarios. This is primarily due to the current alignment approach focusing on optimizing LLMs to generate user feedback directly, without incorporating deliberation. To overcome this limitation and develop more reliable LLMs for recommendations, we propose a new Deliberative Recommendation task, which incorporates explicit reasoning about user preferences as an additional alignment goal. We then introduce the Reasoning-powered Recommender framework for deliberative user preference alignment, designed to enhance reasoning capabilities by utilizing verbalized user feedback in a step-wise manner to tackle this task. The framework employs collaborative step-wise experts and tailored training strategies for each expert. Experimental results across three real-world datasets demonstrate the rationality of the deliberative task formulation and the superior performance of the proposed framework in improving both prediction accuracy and reasoning quality.
中文: 针对当前推荐大语言模型在复杂场景中的局限性,本研究提出了审议式推荐任务及推理驱动的推荐框架,通过逐步处理用户反馈来增强推理能力,在多个数据集上验证了其在预测准确性和推理质量方面的优越表现。
English: To address the limitations of current recommendation LLMs in complex scenarios, this study introduces a Deliberative Recommendation task and a Reasoning-powered Recommender framework that enhances reasoning capabilities through step-wise processing of user feedback, demonstrating superior performance in both accuracy and reasoning quality across multiple datasets.

Authors:Weixiang Zhao, Yulin Hu, Yang Deng, Jiahe Guo, Xingyu Sui, Xinyang Han, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Title: Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Abstract:
Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.
Chinese: 角色扮演微调虽提升大语言模型的角色适应性,却显著降低安全性,尤其反派角色风险更高;为此提出的安全感知角色扮演微调方法(SaRFT)能在多模型中有效平衡角色扮演能力与安全防护。
English: Role-playing fine-tuning in large language models enhances character adaptability but compromises safety, particularly with villainous roles, prompting the development of Safety-Aware Role-Play Fine-Tuning (SaRFT) to effectively balance performance and security across multiple models.

Authors:Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie
Title: FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
Abstract:
This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm. FlexVAR facilitates autoregressive learning with ground-truth prediction, enabling each step to independently produce plausible images. This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable. Trained solely on low-resolution images ($\leq$ 256px), FlexVAR can: (1) Generate images of various resolutions and aspect ratios, even exceeding the resolution of the training images. (2) Support various image-to-image tasks, including image refinement, in/out-painting, and image expansion. (3) Adapt to various autoregressive steps, allowing for faster inference with fewer steps or enhancing image quality with more steps. Our 1.0B model outperforms its VAR counterpart on the ImageNet 256$\times$256 benchmark. Moreover, when zero-shot transfer the image generation process with 13 steps, the performance further improves to 2.08 FID, outperforming state-of-the-art autoregressive models AiM/VAR by 0.25/0.28 FID and popular diffusion models LDM/DiT by 1.52/0.19 FID, respectively. When transferring our 1.0B model to the ImageNet 512$\times$512 benchmark in a zero-shot manner, FlexVAR achieves competitive results compared to the VAR 2.3B model, which is a fully supervised model trained at 512$\times$512 resolution.
中文摘要:FlexVAR提出了一种灵活的视觉自回归范式,通过真实预测实现多分辨率图像生成和多样化任务处理,在多项基准测试中超越了现有模型的性能。
English Summary: FlexVAR introduces a flexible visual autoregressive paradigm that enables ground-truth prediction for generating diverse images across resolutions and tasks, outperforming existing models in benchmarks.

Authors:Yucheng Shi, Tianze Yang, Canyu Chen, Quanzheng Li, Tianming Liu, Xiang Li, Ninghao Liu
Title: SearchRAG: Can Search Engines Be Helpful for LLM-based Medical Question Answering?
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in general domains but often struggle with tasks requiring specialized knowledge. Conventional Retrieval-Augmented Generation (RAG) techniques typically retrieve external information from static knowledge bases, which can be outdated or incomplete, missing fine-grained clinical details essential for accurate medical question answering. In this work, we propose SearchRAG, a novel framework that overcomes these limitations by leveraging real-time search engines. Our method employs synthetic query generation to convert complex medical questions into search-engine-friendly queries and utilizes uncertainty-based knowledge selection to filter and incorporate the most relevant and informative medical knowledge into the LLM's input. Experimental results demonstrate that our method significantly improves response accuracy in medical question answering tasks, particularly for complex questions requiring detailed and up-to-date knowledge.
中文摘要:提出的SearchRAG框架通过实时搜索引擎和合成查询生成,为大型语言模型提供最新、详细的临床知识,显著提升了复杂医学问题回答的准确性。
English Summary: The proposed SearchRAG framework enhances medical question answering by using real-time search engines and synthetic query generation to provide LLMs with current, detailed clinical knowledge, significantly improving accuracy for complex queries.

Authors:Yang Zhao, Li Du, Xiao Ding, Yangou Ouyang, Hepeng Wang, Kai Xiong, Jinglong Gao, Zhouhao Sun, Dongliang Xu, Yang Qing, Dongchen Li, Bing Qin, Ting Liu
Title: Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection
Abstract:
Large language models (LLMs) have shown great potential across various industries due to their remarkable ability to generalize through instruction tuning. However, the limited availability of domain-specific data significantly hampers their performance on specialized tasks. While existing methods primarily focus on selecting training data from general datasets that are similar to the target domain, they often fail to consider the joint distribution of instructions, resulting in inefficient learning and suboptimal knowledge transfer. To address these challenges, we introduce G2IS (Gradient-based Graph Instruction Selection), a novel method that constructs a mixed gradient-based instruction graph to capture the joint distribution and interdependencies between instructions. By accounting for the relationships between instructions, G2IS improves domain adaptation efficiency. Additionally, we propose a gradient walk algorithm to refine the data selection process, enhancing both training effectiveness and efficiency. Our experiments demonstrate that G2IS outperforms traditional methods across various domain adaptation tasks, yielding significant performance gains, particularly in complex, data-scarce scenarios. These results underscore the potential of G2IS in advancing the development of large, domain-specific models.
中文摘要:G2IS是一种基于梯度的新方法,通过捕捉指令的联合分布并优化数据选择,显著提升了大型语言模型在领域适应中的性能,尤其在数据稀缺场景下表现优异。
English Summary: G2IS is a novel gradient-based method that improves domain adaptation for large language models by capturing the joint distribution of instructions and refining data selection, achieving superior performance in data-scarce scenarios.

Authors:Yujie Zhou, Jiazi Bu, Pengyang Ling, Pan Zhang, Tong Wu, Qidong Huang, Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Anyi Rao, Jiaqi Wang, Li Niu
Title: Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
Abstract:
Recent advancements in image relighting models, driven by large-scale datasets and pre-trained diffusion models, have enabled the imposition of consistent lighting. However, video relighting still lags, primarily due to the excessive training costs and the scarcity of diverse, high-quality video relighting datasets. A simple application of image relighting models on a frame-by-frame basis leads to several issues: lighting source inconsistency and relighted appearance inconsistency, resulting in flickers in the generated videos. In this work, we propose Light-A-Video, a training-free approach to achieve temporally smooth video relighting. Adapted from image relighting models, Light-A-Video introduces two key techniques to enhance lighting consistency. First, we design a Consistent Light Attention (CLA) module, which enhances cross-frame interactions within the self-attention layers of the image relight model to stabilize the generation of the background lighting source. Second, leveraging the physical principle of light transport independence, we apply linear blending between the source video's appearance and the relighted appearance, using a Progressive Light Fusion (PLF) strategy to ensure smooth temporal transitions in illumination. Experiments show that Light-A-Video improves the temporal consistency of relighted video while maintaining the relighted image quality, ensuring coherent lighting transitions across frames. Project page: https://bujiazi.github.io/light-a-video.github.io/.
中文摘要:Light-A-Video是一种无需训练的解决方案,通过引入一致性光照注意力模块和渐进式光照融合策略,有效提升视频重照明的时序连贯性,在消除闪烁的同时保持画面质量。
English Summary: Light-A-Video is a training-free method that enhances video relighting consistency by introducing a Consistent Light Attention module and Progressive Light Fusion strategy, addressing flickering issues while preserving image quality.

Authors:Xiao Yu, Yan Fang, Yao Zhao, Yunchao Wei
Title: IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation
Abstract:
Class incremental learning aims to enable models to learn from sequential, non-stationary data streams across different tasks without catastrophic forgetting. In class incremental semantic segmentation (CISS), the semantic content of image pixels evolves over incremental phases, known as semantic drift. In this work, we identify two critical challenges in CISS that contribute to semantic drift and degrade performance. First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages, leading to misaligned probability scales. Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results. To address these challenges, we propose a novel and effective approach, Image Posterior and Semantics Decoupling for Segmentation (IPSeg). IPSeg introduces two key mechanisms: (1) leveraging image posterior probabilities to align optimization across stages and mitigate the effects of separate optimization, and (2) employing semantics decoupling to handle noisy semantics and tailor learning strategies for different semantics. Extensive experiments on the Pascal VOC 2012 and ADE20K datasets demonstrate that IPSeg achieves superior performance compared to state-of-the-art methods, particularly in challenging long-term incremental scenarios.
中文摘要:本研究提出IPSeg方法,通过图像后验概率对齐阶段间优化和语义解耦处理噪声语义,有效解决类别增量语义分割中的语义漂移问题,在基准数据集上实现了最优性能。
English Summary: The study introduces IPSeg, a novel approach for class incremental semantic segmentation that addresses semantic drift by aligning optimization across stages with image posterior probabilities and employing semantics decoupling to handle noisy semantics, achieving state-of-the-art results on benchmark datasets.

Authors:Yuheng Zhang, Dian Yu, Tao Ge, Linfeng Song, Zhichen Zeng, Haitao Mi, Nan Jiang, Dong Yu
Title: Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Abstract:
Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences. Many existing alignment approaches rely on the Bradley-Terry (BT) model assumption, which assumes the existence of a ground-truth reward for each prompt-response pair. However, this assumption can be overly restrictive when modeling complex human preferences. In this paper, we drop the BT model assumption and study LLM alignment under general preferences, formulated as a two-player game. Drawing on theoretical insights from learning in games, we integrate optimistic online mirror descent into our alignment framework to approximate the Nash policy. Theoretically, we demonstrate that our approach achieves an $O(T^{-1})$ bound on the duality gap, improving upon the previous $O(T^{-1/2})$ result. More importantly, we implement our method and show through experiments that it outperforms state-of-the-art RLHF algorithms across multiple representative benchmarks.
中文: 本文提出了一种新颖的人类反馈强化学习方法,摒弃了限制性的Bradley-Terry模型假设,将大语言模型对齐构建为双人博弈框架,不仅在理论上取得突破,更在多个基准测试中超越了现有最优算法的表现。
English: This paper proposes a novel reinforcement learning from human feedback approach that abandons the restrictive Bradley-Terry model assumption and formulates LLM alignment as a two-player game, achieving both theoretical improvements and superior experimental performance over existing methods.

Authors:Ruining Deng, Tianyuan Yao, Yucheng Tang, Junlin Guo, Siqi Lu, Juming Xiong, Lining Yu, Quan Huu Cap, Pengzhou Cai, Libin Lan, Ze Zhao, Adrian Galdran, Amit Kumar, Gunjan Deotale, Dev Kumar Das, Inyoung Paik, Joonho Lee, Geongyu Lee, Yujia Chen, Wangkai Li, Zhaoyang Li, Xuege Hou, Zeyuan Wu, Shengjin Wang, Maximilian Fischer, Lars Kramer, Anghong Du, Le Zhang, Maria Sanchez Sanchez, Helena Sanchez Ulloa, David Ribalta Heredia, Carlos Perez de Arenaza Garcia, Shuoyu Xu, Bingdou He, Xinping Cheng, Tao Wang, Noemie Moreau, Katarzyna Bozek, Shubham Innani, Ujjwal Baid, Kaura Solomon Kefas, Bennett A. Landman, Yu Wang, Shilin Zhao, Mengmeng Yin, Haichun Yang, Yuankai Huo
Title: KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level
Abstract:
Chronic kidney disease (CKD) is a major global health issue, affecting over 10% of the population and causing significant mortality. While kidney biopsy remains the gold standard for CKD diagnosis and treatment, the lack of comprehensive benchmarks for kidney pathology segmentation hinders progress in the field. To address this, we organized the Kidney Pathology Image Segmentation (KPIs) Challenge, introducing a dataset that incorporates preclinical rodent models of CKD with over 10,000 annotated glomeruli from 60+ Periodic Acid Schiff (PAS)-stained whole slide images. The challenge includes two tasks, patch-level segmentation and whole slide image segmentation and detection, evaluated using the Dice Similarity Coefficient (DSC) and F1-score. By encouraging innovative segmentation methods that adapt to diverse CKD models and tissue conditions, the KPIs Challenge aims to advance kidney pathology analysis, establish new benchmarks, and enable precise, large-scale quantification for disease research and diagnosis.
中文: KPIs挑战赛通过引入包含多种慢性肾病模型和超1万个标注肾小球的数据集,旨在推动肾脏病理分割技术的创新,建立新标准以实现精准的疾病研究与诊断。
English: The KPIs Challenge introduces a comprehensive dataset with over 10,000 annotated glomeruli from diverse CKD models to advance kidney pathology segmentation through innovative methods, aiming to establish new benchmarks for precise disease analysis and diagnosis.

Authors:Juming Xiong, Muyang Li, Ruining Deng, Tianyuan Yao, Shunxing Bao, Regina N Tyree, Girish Hiremath, Yuankai Huo
Title: Enhanced Feature-based Image Stitching for Endoscopic Videos in Pediatric Eosinophilic Esophagitis
Abstract:
Video endoscopy represents a major advance in the investigation of gastrointestinal diseases. Reviewing endoscopy videos often involves frequent adjustments and reorientations to piece together a complete view, which can be both time-consuming and prone to errors. Image stitching techniques address this issue by providing a continuous and complete visualization of the examined area. However, endoscopic images, particularly those of the esophagus, present unique challenges. The smooth surface, lack of distinct feature points, and non-horizontal orientation complicate the stitching process, rendering traditional feature-based methods often ineffective for these types of images. In this paper, we propose a novel preprocessing pipeline designed to enhance endoscopic image stitching through advanced computational techniques. Our approach converts endoscopic video data into continuous 2D images by following four key steps: (1) keyframe selection, (2) image rotation adjustment to correct distortions, (3) surface unwrapping using polar coordinate transformation to generate a flat image, and (4) feature point matching enhanced by Adaptive Histogram Equalization for improved feature detection. We evaluate stitching quality through the assessment of valid feature point match pairs. Experiments conducted on 20 pediatric endoscopy videos demonstrate that our method significantly improves image alignment and stitching quality compared to traditional techniques, laying a robust foundation for more effective panoramic image creation.
中文: 本文提出了一种新颖的预处理流程,通过校正畸变和增强特征检测来改进内窥镜图像拼接,在图像对齐和质量上显著优于传统方法。
English: This paper introduces a novel preprocessing pipeline that enhances endoscopic image stitching by correcting distortions and improving feature detection, significantly outperforming traditional methods in alignment and quality.

Authors:Juming Xiong, Hou Xiong, Quan Liu, Ruining Deng, Regina N Tyree, Girish Hiremath, Yuankai Huo
Title: Expanding Training Data for Endoscopic Phenotyping of Eosinophilic Esophagitis
Abstract:
Eosinophilic esophagitis (EoE) is a chronic esophageal disorder marked by eosinophil-dominated inflammation. Diagnosing EoE usually involves endoscopic inspection of the esophageal mucosa and obtaining esophageal biopsies for histologic confirmation. Recent advances have seen AI-assisted endoscopic imaging, guided by the EREFS system, emerge as a potential alternative to reduce reliance on invasive histological assessments. Despite these advancements, significant challenges persist due to the limited availability of data for training AI models - a common issue even in the development of AI for more prevalent diseases. This study seeks to improve the performance of deep learning-based EoE phenotype classification by augmenting our training data with a diverse set of images from online platforms, public datasets, and electronic textbooks increasing our dataset from 435 to 7050 images. We utilized the Data-efficient Image Transformer for image classification and incorporated attention map visualizations to boost interpretability. The findings show that our expanded dataset and model enhancements improved diagnostic accuracy, robustness, and comprehensive analysis, enhancing patient outcomes.
中文: 本研究通过将训练数据集从435张图像扩充至7050张多样化图像,并采用数据高效图像变换器结合注意力图谱,显著提升了嗜酸粒细胞性食管炎的深度学习分类性能,从而改善了诊断准确性和患者预后。
English: This study enhances deep learning-based classification of eosinophilic esophagitis by expanding the training dataset from 435 to 7050 diverse images and using a Data-efficient Image Transformer with attention maps, resulting in improved diagnostic accuracy and patient outcomes.

Authors:Zhaoyi Li, Gangwei Jiang, Chenwang Wu, Ying Wei, Defu Lian, Enhong Chen
Title: Learning to Substitute Components for Compositional Generalization
Abstract:
Despite the rising prevalence of neural language models, recent empirical evidence suggests their deficiency in compositional generalization. One of the current de-facto solutions to this problem is compositional data augmentation, which aims to introduce additional compositional inductive bias. However, existing handcrafted augmentation strategies offer limited improvement when systematic generalization of neural language models requires multi-grained compositional bias (i.e., not limited to either lexical or structural biases alone) or when training sentences have an imbalanced difficulty distribution. To address these challenges, we first propose a novel compositional augmentation strategy called Component Substitution (CompSub), which enables multi-grained composition of substantial substructures across the entire training set. Furthermore, we introduce the Learning Component Substitution (LCS) framework. This framework empowers the learning of component substitution probabilities in CompSub in an end-to-end manner by maximizing the loss of neural language models, thereby prioritizing challenging compositions with elusive concepts and novel contexts. We extend the key ideas of CompSub and LCS to the recently emerging in-context learning scenarios of pre-trained large language models (LLMs), proposing the LCS-ICL algorithm to enhance the few-shot compositional generalization of state-of-the-art (SOTA) LLMs. Theoretically, we provide insights into why applying our algorithms to language models can improve compositional generalization performance. Empirically, our results on four standard compositional generalization benchmarks(SCAN, COGS, GeoQuery, and COGS-QL) demonstrate the superiority of CompSub, LCS, and LCS-ICL, with improvements of up to 66.5%, 10.3%, 1.4%, and 8.8%, respectively.
中文: 本文提出组件替换(CompSub)和学习组件替换(LCS)框架,通过多粒度数据增强和端到端学习替换概率来提升神经语言模型的组合泛化能力,在四个基准测试中取得了显著改进。
English: This paper introduces Component Substitution (CompSub) and the Learning Component Substitution (LCS) framework to enhance neural language models' compositional generalization through multi-grained data augmentation and end-to-end learning of substitution probabilities, achieving significant improvements on four benchmarks.

Authors:Alvaro Becerra, Roberto Daza, Ruth Cobos, Aythami Morales, Julian Fierrez
Title: M2LADS Demo: A System for Generating Multimodal Learning Analytics Dashboards
Abstract:
We present a demonstration of a web-based system called M2LADS ("System for Generating Multimodal Learning Analytics Dashboards"), designed to integrate, synchronize, visualize, and analyze multimodal data recorded during computer-based learning sessions with biosensors. This system presents a range of biometric and behavioral data on web-based dashboards, providing detailed insights into various physiological and activity-based metrics. The multimodal data visualized include electroencephalogram (EEG) data for assessing attention and brain activity, heart rate metrics, eye-tracking data to measure visual attention, webcam video recordings, and activity logs of the monitored tasks. M2LADS aims to assist data scientists in two key ways: (1) by providing a comprehensive view of participants' experiences, displaying all data categorized by the activities in which participants are engaged, and (2) by synchronizing all biosignals and videos, facilitating easier data relabeling if any activity information contains errors.
中文: M2LADS是一个基于网络的系统,可集成并可视化多模态数据,包括脑电图、心率、眼动追踪和活动日志,为研究人员提供全面洞察并促进同步数据分析。
English: M2LADS is a web-based system that integrates and visualizes multimodal data, including EEG, heart rate, eye-tracking, and activity logs, to provide comprehensive insights and facilitate synchronized data analysis for researchers.

Authors:Hao Wang, Wei Guo, Luankang Zhang, Jin Yao Chin, Yufei Ye, Huifeng Guo, Yong Liu, Defu Lian, Ruiming Tang, Enhong Chen
Title: Generative Large Recommendation Models: Emerging Trends in LLMs for Recommendation
Abstract:
In the era of information overload, recommendation systems play a pivotal role in filtering data and delivering personalized content. Recent advancements in feature interaction and user behavior modeling have significantly enhanced the recall and ranking processes of these systems. With the rise of large language models (LLMs), new opportunities have emerged to further improve recommendation systems. This tutorial explores two primary approaches for integrating LLMs: LLMs-enhanced recommendations, which leverage the reasoning capabilities of general LLMs, and generative large recommendation models, which focus on scaling and sophistication. While the former has been extensively covered in existing literature, the latter remains underexplored. This tutorial aims to fill this gap by providing a comprehensive overview of generative large recommendation models, including their recent advancements, challenges, and potential research directions. Key topics include data quality, scaling laws, user behavior mining, and efficiency in training and inference. By engaging with this tutorial, participants will gain insights into the latest developments and future opportunities in the field, aiding both academic research and practical applications. The timely nature of this exploration supports the rapid evolution of recommendation systems, offering valuable guidance for researchers and practitioners alike.
中文摘要:本教程旨在填补生成式大型推荐模型的研究空白,探讨其最新进展、数据质量与扩展性等挑战,以及未来研究方向,以推动推荐系统的发展。
English Summary: This tutorial addresses the gap in research on generative large recommendation models by examining their advancements, challenges like data quality and scaling, and future directions to enhance recommendation systems.

Authors:Jiayi Zhang, Ziheng Liu, Yiyang Zhu, Enyu Shi, Bokai Xu, Chau Yuen, Dusit Niyato, Mérouane Debbah, Shi Jin, Bo Ai, Xuemin, Shen
Title: Multi-Agent Reinforcement Learning in Wireless Distributed Networks for 6G
Abstract:
The introduction of intelligent interconnectivity between the physical and human worlds has attracted great attention for future sixth-generation (6G) networks, emphasizing massive capacity, ultra-low latency, and unparalleled reliability. Wireless distributed networks and multi-agent reinforcement learning (MARL), both of which have evolved from centralized paradigms, are two promising solutions for the great attention. Given their distinct capabilities, such as decentralization and collaborative mechanisms, integrating these two paradigms holds great promise for unleashing the full power of 6G, attracting significant research and development attention. This paper provides a comprehensive study on MARL-assisted wireless distributed networks for 6G. In particular, we introduce the basic mathematical background and evolution of wireless distributed networks and MARL, as well as demonstrate their interrelationships. Subsequently, we analyze different structures of wireless distributed networks from the perspectives of homogeneous and heterogeneous. Furthermore, we introduce the basic concepts of MARL and discuss two typical categories, including model-based and model-free. We then present critical challenges faced by MARL-assisted wireless distributed networks, providing important guidance and insights for actual implementation. We also explore an interplay between MARL-assisted wireless distributed networks and emerging techniques, such as information bottleneck and mirror learning, delivering in-depth analyses and application scenarios. Finally, we outline several compelling research directions for future MARL-assisted wireless distributed networks.
中文摘要:本文全面研究了多智能体强化学习与无线分布式网络在6G中的融合,通过应对关键挑战并探索与新兴技术的协同作用,为释放6G潜力提供重要指导和研究方向。
English Summary: This paper comprehensively studies the integration of multi-agent reinforcement learning (MARL) with wireless distributed networks to unlock 6G's potential by addressing challenges and exploring synergies with emerging technologies.

Authors:Ruiyang Ren, Yuhao Wang, Junyi Li, Jinhao Jiang, Wayne Xin Zhao, Wenjie Wang, Tat-Seng Chua
Title: Holistically Guided Monte Carlo Tree Search for Intricate Information Seeking
Abstract:
In the era of vast digital information, the sheer volume and heterogeneity of available information present significant challenges for intricate information seeking. Users frequently face multistep web search tasks that involve navigating vast and varied data sources. This complexity demands every step remains comprehensive, accurate, and relevant. However, traditional search methods often struggle to balance the need for localized precision with the broader context required for holistic understanding, leaving critical facets of intricate queries underexplored. In this paper, we introduce an LLM-based search assistant that adopts a new information seeking paradigm with holistically guided Monte Carlo tree search (HG-MCTS). We reformulate the task as a progressive information collection process with a knowledge memory and unite an adaptive checklist with multi-perspective reward modeling in MCTS. The adaptive checklist provides explicit sub-goals to guide the MCTS process toward comprehensive coverage of complex user queries. Simultaneously, our multi-perspective reward modeling offers both exploration and retrieval rewards, along with progress feedback that tracks completed and remaining sub-goals, refining the checklist as the tree search progresses. By striking a balance between localized tree expansion and global guidance, HG-MCTS reduces redundancy in search paths and ensures that all crucial aspects of an intricate query are properly addressed. Extensive experiments on real-world intricate information seeking tasks demonstrate that HG-MCTS acquires thorough knowledge collections and delivers more accurate final responses compared with existing baselines.
中文: 本文提出了一种整体引导的蒙特卡洛树搜索方法(HG-MCTS),通过自适应检查清单和多视角奖励机制,在复杂多步骤网络搜索中实现全面覆盖与精准响应,实验证明其优于现有基线方法。
English: This paper introduces a holistically guided Monte Carlo tree search (HG-MCTS) method that uses an adaptive checklist and multi-perspective rewards to comprehensively address complex multistep web searches, outperforming existing approaches by ensuring all query aspects are covered with minimal redundancy.

Authors:Cheng He, Xu Huang, Gangwei Jiang, Zhaoyi Li, Defu Lian, Hong Xie, Enhong Chen, Xijie Liang, Zengrong Zheng
Title: General Time-series Model for Universal Knowledge Representation of Multivariate Time-Series data
Abstract:
Universal knowledge representation is a central problem for multivariate time series(MTS) foundation models and yet remains open. This paper investigates this problem from the first principle and it makes four folds of contributions. First, a new empirical finding is revealed: time series with different time granularities (or corresponding frequency resolutions) exhibit distinct joint distributions in the frequency domain. This implies a crucial aspect of learning universal knowledge, one that has been overlooked by previous studies. Second, a novel Fourier knowledge attention mechanism is proposed to enable learning time granularity-aware representations from both the temporal and frequency domains. Third, an autoregressive blank infilling pre-training framework is incorporated to time series analysis for the first time, leading to a generative tasks agnostic pre-training strategy. To this end, we develop the General Time-series Model (GTM), a unified MTS foundation model that addresses the limitation of contemporary time series models, which often require token, pre-training, or model-level customizations for downstream tasks adaption. Fourth, extensive experiments show that GTM outperforms state-of-the-art (SOTA) methods across all generative tasks, including long-term forecasting, anomaly detection, and imputation.
中文摘要:本文提出了通用时间序列模型(GTM),通过傅里叶知识注意力机制和自回归预训练框架,首次实现了无需下游任务适配的统一多变量时序基础模型,在长期预测、异常检测等生成任务中全面超越现有最优方法。
English Summary: This paper introduces the General Time-series Model (GTM), a universal foundation model for multivariate time series that leverages Fourier knowledge attention and autoregressive pre-training to outperform state-of-the-art methods across multiple tasks without requiring task-specific adaptations.

Authors:Yufei Wei, Sha Lu, Wangtao Lu, Rong Xiong, Yue Wang
Title: BEV-DWPVO: BEV-based Differentiable Weighted Procrustes for Low Scale-drift Monocular Visual Odometry on Ground
Abstract:
Monocular Visual Odometry (MVO) provides a cost-effective, real-time positioning solution for autonomous vehicles. However, MVO systems face the common issue of lacking inherent scale information from monocular cameras. Traditional methods have good interpretability but can only obtain relative scale and suffer from severe scale drift in long-distance tasks. Learning-based methods under perspective view leverage large amounts of training data to acquire prior knowledge and estimate absolute scale by predicting depth values. However, their generalization ability is limited due to the need to accurately estimate the depth of each point. In contrast, we propose a novel MVO system called BEV-DWPVO. Our approach leverages the common assumption of a ground plane, using Bird's-Eye View (BEV) feature maps to represent the environment in a grid-based structure with a unified scale. This enables us to reduce the complexity of pose estimation from 6 Degrees of Freedom (DoF) to 3-DoF. Keypoints are extracted and matched within the BEV space, followed by pose estimation through a differentiable weighted Procrustes solver. The entire system is fully differentiable, supporting end-to-end training with only pose supervision and no auxiliary tasks. We validate BEV-DWPVO on the challenging long-sequence datasets NCLT, Oxford, and KITTI, achieving superior results over existing MVO methods on most evaluation metrics.
中文:提出的BEV-DWPVO系统通过鸟瞰图表示法解决单目视觉里程计的尺度不确定性问题,将位姿估计简化为三自由度并实现仅需位姿监督的端到端训练,在多个挑战性数据集上展现出优越性能。
English: The proposed BEV-DWPVO system addresses monocular visual odometry's scale ambiguity by using bird's-eye view representations to reduce pose estimation complexity and achieve end-to-end training with pose supervision alone, demonstrating superior performance on multiple challenging datasets.

Authors:Dongkun Zhang, Jiaming Liang, Ke Guo, Sha Lu, Qi Wang, Rong Xiong, Zhenwei Miao, Yue Wang
Title: CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-scale Reinforcement Learning in Autonomous Driving
Abstract:
Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios. In this paper, we introduce \textbf{CarPlanner}, a \textbf{C}onsistent \textbf{a}uto-\textbf{r}egressive \textbf{Planner} that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance. Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving. To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.
中文摘要:CarPlanner是一种基于强化学习的轨迹规划新方法,通过自回归结构和一致性机制解决了训练效率问题,并在真实自动驾驶数据集中超越了现有最优方法。
English Summary: CarPlanner is a novel RL-based trajectory planning method that uses an auto-regressive structure with consistency mechanisms to overcome training inefficiencies and outperform existing approaches on real-world autonomous driving datasets.

Authors:Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zhongyuan Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, Bo Zheng
Title: Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Abstract:
Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.
中文: 本文提出DeltaBench基准,用于评估o1类模型的长思维链推理能力,并测试现有大语言模型在长推理链中检测错误的能力,通过细粒度分析和评估为模型开发提供指导。
English: This paper introduces DeltaBench, a benchmark for evaluating o1-like models' long Chain-of-Thought reasoning and assessing existing LLMs' ability to detect errors in such reasoning, aiming to guide model development through detailed analysis and evaluation.

Authors:Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, Di Zhang
Title: FlexDuo: A Pluggable System for Enabling Full-Duplex Capabilities in Speech Dialogue Systems
Abstract:
Full-Duplex Speech Dialogue Systems (Full-Duplex SDS) have significantly enhanced the naturalness of human-machine interaction by enabling real-time bidirectional communication. However, existing approaches face challenges such as difficulties in independent module optimization and contextual noise interference due to highly coupled architectural designs and oversimplified binary state modeling. This paper proposes FlexDuo, a flexible full-duplex control module that decouples duplex control from spoken dialogue systems through a plug-and-play architectural design. Furthermore, inspired by human information-filtering mechanisms in conversations, we introduce an explicit Idle state. On one hand, the Idle state filters redundant noise and irrelevant audio to enhance dialogue quality. On the other hand, it establishes a semantic integrity-based buffering mechanism, reducing the risk of mutual interruptions while ensuring accurate response transitions. Experimental results on the Fisher corpus demonstrate that FlexDuo reduces the false interruption rate by 24.9% and improves response accuracy by 7.6% compared to integrated full-duplex dialogue system baselines. It also outperforms voice activity detection (VAD) controlled baseline systems in both Chinese and English dialogue quality. The proposed modular architecture and state-based dialogue model provide a novel technical pathway for building flexible and efficient duplex dialogue systems.
中文摘要:FlexDuo通过即插即用架构和显式空闲状态设计,在过滤噪音的同时建立语义缓冲机制,将误中断率降低24.9%,响应准确率提升7.6%,为全双工对话系统提供了灵活高效的解决方案。
English Summary: FlexDuo introduces a plug-and-play full-duplex control module with an explicit Idle state that reduces false interruptions by 24.9% and improves response accuracy by 7.6% by filtering noise and establishing semantic buffering.

Authors:Qinghe Wang, Yawen Luo, Xiaoyu Shi, Xu Jia, Huchuan Lu, Tianfan Xue, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai
Title: CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation
Abstract:
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware conditional signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals--comprising rendered depth maps, camera trajectories and object class labels--serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation. Project page: https://cinemaster-dev.github.io/.
中文: CineMaster是一个创新的两阶段框架,通过交互式3D信号构建和引导式扩散模型,实现具有三维感知的文本到视频生成,支持精确的对象布局与摄像机运动控制。
English: CineMaster is a two-stage framework for 3D-aware text-to-video generation that enables precise object placement, camera manipulation, and layout control through interactive 3D signal construction and guided diffusion modeling.

Authors:Lu Chen, Lipeng Chen, Xiangchi Chen, Haojian Lu, Yu Zheng, Jun Wu, Yue Wang, Zhengyou Zhang, Rong Xiong
Title: Compliance while resisting: a shear-thickening fluid controller for physical human-robot interaction
Abstract:
Physical human-robot interaction (pHRI) is widely needed in many fields, such as industrial manipulation, home services, and medical rehabilitation, and puts higher demands on the safety of robots. Due to the uncertainty of the working environment, the pHRI may receive unexpected impact interference, which affects the safety and smoothness of the task execution. The commonly used linear admittance control (L-AC) can cope well with high-frequency small-amplitude noise, but for medium-frequency high-intensity impact, the effect is not as good. Inspired by the solid-liquid phase change nature of shear-thickening fluid, we propose a Shear-thickening Fluid Control (SFC) that can achieve both an easy human-robot collaboration and resistance to impact interference. The SFC's stability, passivity, and phase trajectory are analyzed in detail, the frequency and time domain properties are quantified, and parameter constraints in discrete control and coupled stability conditions are provided. We conducted simulations to compare the frequency and time domain characteristics of L-AC, nonlinear admittance controller (N-AC), and SFC, and validated their dynamic properties. In real-world experiments, we compared the performance of L-AC, N-AC, and SFC in both fixed and mobile manipulators. L-AC exhibits weak resistance to impact. N-AC can resist moderate impacts but not high-intensity ones, and may exhibit self-excited oscillations. In contrast, SFC demonstrated superior impact resistance and maintained stable collaboration, enhancing comfort in cooperative water delivery tasks. Additionally, a case study was conducted in a factory setting, further affirming the SFC's capability in facilitating human-robot collaborative manipulation and underscoring its potential in industrial applications.
中文: 受剪切增稠流体启发,所提出的剪切增稠流体控制(SFC)相比线性和非线性导纳控制器,在物理人机交互中展现出更优异的抗冲击性和稳定协作能力,有效提升了任务安全性与操作舒适度。
English: Inspired by shear-thickening fluids, the proposed Shear-thickening Fluid Control (SFC) demonstrates superior impact resistance and stable collaboration compared to linear and nonlinear admittance controllers, enhancing safety and comfort in human-robot interaction tasks.

Authors:Yihong Dong, Ge Li, Xue Jiang, Yongding Tao, Kechi Zhang, Hao Zhu, Huanyu Liu, Jiazheng Ding, Jia Li, Jinliang Deng, Hong Mei
Title: FANformer: Improving Large Language Models Through Effective Periodicity Modeling
Abstract:
Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which adapts Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. Our pretrained FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. Moreover, we reveal that FANformer exhibits superior ability to learn and apply rules for reasoning compared to Transformer. The results position FANformer as an effective and promising architecture for advancing LLMs.
中文: 通过在Transformer架构中引入有效的周期性建模,FANformer提升了大型语言模型的学习效率、性能及推理能力,在模型扩展和下游任务中均优于传统方法。
English: Integrating effective periodicity modeling into the Transformer architecture through FANformer enhances learning efficiency, performance, and reasoning capabilities in large language models, outperforming traditional approaches in scaling and downstream tasks.

Authors:Yihong Dong, Ge Li, Xue Jiang, Yongding Tao, Kechi Zhang, Hao Zhu, Huanyu Liu, Jiazheng Ding, Jia Li, Jinliang Deng, Hong Mei
Title: FANformer: Improving Large Language Models Through Effective Periodicity Modeling
Abstract:
Periodicity, as one of the most important basic characteristics, lays the foundation for facilitating structured knowledge acquisition and systematic cognitive processes within human learning paradigms. However, the potential flaws of periodicity modeling in Transformer affect the learning efficiency and establishment of underlying principles from data for large language models (LLMs) built upon it. In this paper, we demonstrate that integrating effective periodicity modeling can improve the learning efficiency and performance of LLMs. We introduce FANformer, which adapts Fourier Analysis Network (FAN) into attention mechanism to achieve efficient periodicity modeling, by modifying the feature projection process of attention mechanism. Extensive experimental results on language modeling show that FANformer consistently outperforms Transformer when scaling up model size and training tokens, underscoring its superior learning efficiency. Our pretrained FANformer-1B exhibits marked improvements on downstream tasks compared to open-source LLMs with similar model parameters or training tokens. Moreover, we reveal that FANformer exhibits superior ability to learn and apply rules for reasoning compared to Transformer. The results position FANformer as an effective and promising architecture for advancing LLMs.
中文: 通过在Transformer架构中引入有效的周期性建模,FANformer提升了大型语言模型的学习效率、性能及推理能力,在模型扩展和下游任务中均优于传统方法。
English: Integrating effective periodicity modeling into the Transformer architecture through FANformer enhances learning efficiency, performance, and reasoning capabilities in large language models, outperforming traditional approaches in scaling and downstream tasks.

Authors:Qianxi He, Qianyu He, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Title: Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation
Abstract:
Logical reasoning is essential for large language models (LLMs) to ensure accurate and coherent inference. However, LLMs struggle with reasoning order variations and fail to generalize across logically equivalent transformations. LLMs often rely on fixed sequential patterns rather than true logical understanding. To address this issue, we introduce an order-centric data augmentation framework based on commutativity in logical reasoning. We first randomly shuffle independent premises to introduce condition order augmentation. For reasoning steps, we construct a directed acyclic graph (DAG) to model dependencies between steps, which allows us to identify valid reorderings of steps while preserving logical correctness. By leveraging order-centric augmentations, models can develop a more flexible and generalized reasoning process. Finally, we conduct extensive experiments across multiple logical reasoning benchmarks, demonstrating that our method significantly enhances LLMs' reasoning performance and adaptability to diverse logical structures. We release our codes and augmented data in https://anonymous.4open.science/r/Order-Centric-Data-Augmentation-822C/.
中文摘要:本文提出了一种以顺序为中心的数据增强框架,通过打乱前提条件和重排推理步骤来增强大型语言模型的逻辑推理能力,显著提升了其在多种基准测试中的性能和适应性。
English Summary: This paper introduces an order-centric data augmentation framework that enhances large language models' logical reasoning by shuffling premises and reordering reasoning steps, significantly improving their performance and adaptability across various benchmarks.

Authors:Jiawei Kong, Hao Fang, Sihang Guo, Chenxi Qing, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Ke Xu
Title: Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP
Abstract:
While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit impressive representational capabilities for multimodal data, recent studies have revealed their vulnerability to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model. However, the substantial model parameters increase the difficulty of reaching a stable and consistent optimization direction, limiting their resistance against state-of-the-art attacks and often resulting in a degradation of clean accuracy. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT), an efficient and effective defense mechanism that operates on text prompts to indirectly purify poisoned CLIP. Specifically, we first employ the advanced contrastive learning via carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we leverage three well-designed loss functions to optimize these class-wise text prompts, modifying the model's decision boundary and further reclassifying the feature regions affected by backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.83% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen CLIP's robustness against backdoor attacks.
中文: 提出的类别级后门提示调优(CBPT)通过优化文本提示来净化被污染的特征,有效防御CLIP模型中的后门攻击,在保持高清洁准确率的同时显著降低了攻击成功率。
English: The proposed Class-wise Backdoor Prompt Tuning (CBPT) effectively defends against backdoor attacks in CLIP models by optimizing text prompts to purify poisoned features, maintaining high clean accuracy and significantly reducing attack success rates.

Authors:Jiaxi Li, Yiwei Wang, Kai Zhang, Yujun Cai, Bryan Hooi, Nanyun Peng, Kai-Wei Chang, Jin Lu
Title: Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments
Abstract:
Large language models (LLMs) have been widely adopted in various downstream task domains. However, their abilities to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate the factuality of LLMs to retain medical knowledge. To address this challenge, we introduce the Medical Knowledge Judgment Dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized biomedical vocabularies and knowledge graphs. Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements, enabling direct measurement of their knowledge retention capabilities. Our experiments reveal that LLMs have difficulty accurately recalling medical facts, with performances varying substantially across semantic types and showing notable weakness in uncommon medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.
中文: 大语言模型在准确回忆医学事实方面存在困难且常对错误答案过度自信,为此引入了专门的数据集评估其知识掌握,并探索了检索增强生成方法来提升准确性。
English: Large language models struggle with accurately recalling medical facts and exhibit overconfidence in incorrect answers, prompting the introduction of a specialized dataset to evaluate their knowledge and the exploration of retrieval-augmented generation to enhance accuracy.

Authors:Zhen Xiong, Yujun Cai, Bryan Hooi, Nanyun Peng, Zhecheng Li, Yiwei Wang
Title: Enhancing LLM Character-Level Manipulation via Divide and Conquer
Abstract:
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks. However, they exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution. These challenges stem primarily from tokenization constraints, despite the critical role of such operations in data preprocessing and code generation. Through systematic analysis, we derive two key insights: (1) LLMs face significant difficulties in leveraging intrinsic token knowledge for character-level reasoning, and (2) atomized word structures can substantially enhance LLMs' ability to process token-level structural information. Building on these insights, we propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation. Our method decomposes complex operations into explicit character-level subtasks coupled with controlled token reconstruction phases, leading to significant improvements in accuracy. Without additional training, our method significantly improves accuracies on the $\texttt{Deletion}$, $\texttt{Insertion}$, and $\texttt{Substitution}$ tasks. To support further research, we open-source our implementation and benchmarks.
Chinese: 大语言模型在字符级字符串操作方面存在困难,但我们提出的分治字符级操作方法通过分解任务和重构标记,显著提升了其在删除、插入和替换任务上的准确率,且无需额外训练。
English: Large Language Models struggle with character-level string manipulation due to tokenization limitations, but our proposed Character-Level Manipulation via Divide and Conquer method significantly improves their performance on deletion, insertion, and substitution tasks without requiring additional training.

Authors:Wenhao You, Bryan Hooi, Yiwei Wang, Euijin Choo, Ming-Hsuan Yang, Junsong Yuan, Zi Huang, Yujun Cai
Title: Lost in Edits? A $λ$-Compass for AIGC Provenance
Abstract:
Recent advancements in diffusion models have driven the growth of text-guided image editing tools, enabling precise and iterative modifications of synthesized content. However, as these tools become increasingly accessible, they also introduce significant risks of misuse, emphasizing the critical need for robust attribution methods to ensure content authenticity and traceability. Despite the creative potential of such tools, they pose significant challenges for attribution, particularly in adversarial settings where edits can be layered to obscure an image's origins. We propose LambdaTracer, a novel latent-space attribution method that robustly identifies and differentiates authentic outputs from manipulated ones without requiring any modifications to generative or editing pipelines. By adaptively calibrating reconstruction losses, LambdaTracer remains effective across diverse iterative editing processes, whether automated through text-guided editing tools such as InstructPix2Pix and ControlNet or performed manually with editing software such as Adobe Photoshop. Extensive experiments reveal that our method consistently outperforms baseline approaches in distinguishing maliciously edited images, providing a practical solution to safeguard ownership, creativity, and credibility in the open, fast-evolving AI ecosystems.
中文摘要:LambdaTracer是一种新型潜在空间溯源方法,无需修改生成流程即可有效识别被篡改图像,在各类编辑过程中检测恶意修改的性能均优于基线方法。
English Summary: LambdaTracer is a novel latent-space attribution method that effectively identifies manipulated images without altering generative pipelines, outperforming baselines in detecting malicious edits across diverse editing processes.

Authors:Amitava Das, Yaswanth Narsupalli, Gurpreet Singh, Vinija Jain, Vasu Sharma, Suranjana Trivedy, Aman Chadha, Amit Sheth
Title: YINYANG-ALIGN: Benchmarking Contradictory Objectives and Proposing Multi-Objective Optimization based DPO for Text-to-Image Alignment
Abstract:
Precise alignment in Text-to-Image (T2I) systems is crucial to ensure that generated visuals not only accurately encapsulate user intents but also conform to stringent ethical and aesthetic benchmarks. Incidents like the Google Gemini fiasco, where misaligned outputs triggered significant public backlash, underscore the critical need for robust alignment mechanisms. In contrast, Large Language Models (LLMs) have achieved notable success in alignment. Building on these advancements, researchers are eager to apply similar alignment techniques, such as Direct Preference Optimization (DPO), to T2I systems to enhance image generation fidelity and reliability. We present YinYangAlign, an advanced benchmarking framework that systematically quantifies the alignment fidelity of T2I systems, addressing six fundamental and inherently contradictory design objectives. Each pair represents fundamental tensions in image generation, such as balancing adherence to user prompts with creative modifications or maintaining diversity alongside visual coherence. YinYangAlign includes detailed axiom datasets featuring human prompts, aligned (chosen) responses, misaligned (rejected) AI-generated outputs, and explanations of the underlying contradictions.
中文: YinYangAlign是一个先进的基准框架,通过解决文本到图像系统中六个基本矛盾的设计目标(如用户提示遵循与创意修改的平衡),系统化评估其对齐保真度,并利用包含人类提示和对比输出的数据集来提升生成可靠性。
English: YinYangAlign is a benchmarking framework that systematically evaluates the alignment fidelity of Text-to-Image systems by addressing six fundamental contradictory design objectives, such as balancing user prompt adherence with creative modifications, using detailed datasets to enhance generation reliability.

Authors:Samyak Rawlekar, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja
Title: Efficiently Disentangling CLIP for Multi-Object Perception
Abstract:
Vision-language models like CLIP excel at recognizing the single, prominent object in a scene. However, they struggle in complex scenes containing multiple objects. We identify a fundamental reason for this limitation: VLM feature space exhibits excessive mutual feature information (MFI), where the features of one class contain substantial information about other, unrelated classes. This high MFI becomes evident during class-specific queries, as unrelated objects are activated alongside the queried class. To address this limitation, we propose DCLIP, an efficient framework that learns an optimal level of mutual information while adding only minimal learnable parameters to a frozen VLM. DCLIP uses two complementary losses: a novel MFI Loss that regulates class feature similarity to prevent excessive overlap while preserving necessary shared information, and the Asymmetric Loss (ASL) that aligns image features with the disentangled text features. Through this disentanglement, DCLIP reduces excessive inter-class similarity by 30%. On multi-label recognition, DCLIP performs favorably over SOTA approaches on VOC2007 and COCO-14 while using 75% fewer training parameters. For zero-shot semantic segmentation, it shows improved performance across six benchmark datasets. These results highlight the importance of feature disentanglement for multi-object perception in VLMs.
中文摘要:视觉语言模型因类别间存在过度相互特征信息而难以处理复杂场景,DCLIP框架通过互补损失函数学习最优互信息水平,有效提升多目标识别性能。
English Summary: Vision-language models struggle with complex scenes due to excessive mutual feature information between unrelated classes, which DCLIP addresses by learning optimal mutual information levels through complementary losses to improve multi-object recognition.

Authors:Arpita Vats, Rahul Raja, Mrinal Mathur, Vinija Jain, Aman Chadha
Title: Multilingual State Space Models for Structured Question Answering in Indic Languages
Abstract:
The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs),to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. We propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.
中文: 本研究开创性地应用状态空间模型为印度语言构建高效问答系统,验证了其处理语言复杂性的卓越能力,并为该领域未来研究奠定了基准基础。
English: This paper pioneers the application of State Space Models (SSMs) to develop efficient question answering systems for Indic languages, demonstrating their superior ability to handle linguistic complexities and establishing foundational benchmarks for future research.

Authors:Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
Title: OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation
Abstract:
Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN. Firstly, we integrate diverse rendering engines and advanced techniques for environment simulation, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of our environments. Secondly, we develop a highly automated toolchain for aerial VLN data collection, streamlining point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Thirdly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. Moreover, we propose OpenFly-Agent, a keyframe-aware VLN model emphasizing key observations during flight. For benchmarking, extensive experiments and analyses are conducted, evaluating several recent VLN methods and showcasing the superiority of our OpenFly platform and agent. The toolchain, dataset, and codes will be open-sourced.
中文: 本文提出OpenFly平台,整合多种渲染引擎和自动化工具链,构建大规模户外空中视觉语言导航基准数据集,并设计关键帧感知导航模型,通过实验验证了其优越性。
English: The paper introduces OpenFly, a comprehensive platform with rendering engines, a toolchain, and a large-scale benchmark to address the scarcity in outdoor aerial Vision-Language Navigation, alongside a keyframe-aware agent that demonstrates superior performance in experiments.

Authors:Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Title: When Can We Solve the Weighted Low Rank Approximation Problem in Truly Subquadratic Time?
Abstract:
The weighted low-rank approximation problem is a fundamental numerical linear algebra problem and has many applications in machine learning. Given a $n \times n$ weight matrix $W$ and a $n \times n$ matrix $A$, the goal is to find two low-rank matrices $U, V \in \mathbb{R}^{n \times k}$ such that the cost of $\| W \circ (U V^\top - A) \|_F^2$ is minimized. Previous work has to pay $Ω(n^2)$ time when matrices $A$ and $W$ are dense, e.g., having $Ω(n^2)$ non-zero entries. In this work, we show that there is a certain regime, even if $A$ and $W$ are dense, we can still hope to solve the weighted low-rank approximation problem in almost linear $n^{1+o(1)}$ time.
Chinese: 本研究提出了一种高效的近似线性时间算法,用于解决稠密加权低秩逼近问题,相比之前需要二次时间的方法实现了显著改进。
English: This work introduces an efficient almost linear time algorithm for solving the dense weighted low-rank approximation problem, significantly improving upon previous methods that required quadratic time.

Authors:Chengyue Gong, Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
Title: On Computational Limits of FlowAR Models: Expressivity and Efficiency
Abstract:
The expressive power and computational complexity of deep visual generative models, such as flow-based and autoregressive (AR) models, have gained considerable interest for their wide-ranging applications in generative tasks. However, the theoretical characterization of their expressiveness through the lens of circuit complexity remains underexplored, particularly for the state-of-the-art architecture like FlowAR proposed by [Ren et al., 2024], which integrates flow-based and autoregressive mechanisms. This gap limits our understanding of their inherent computational limits and practical efficiency. In this study, we address this gap by analyzing the circuit complexity of the FlowAR architecture. We demonstrate that when the largest feature map produced by the FlowAR model has dimensions $n \times n \times c$, the FlowAR model is simulable by a family of threshold circuits $\mathsf{TC}^0$, which have constant depth $O(1)$ and polynomial width $\mathrm{poly}(n)$. This is the first study to rigorously highlight the limitations in the expressive power of FlowAR models. Furthermore, we identify the conditions under which the FlowAR model computations can achieve almost quadratic time. To validate our theoretical findings, we present efficient model variant constructions based on low-rank approximations that align with the derived criteria. Our work provides a foundation for future comparisons with other generative paradigms and guides the development of more efficient and expressive implementations.
中文: 本研究分析了FlowAR架构的电路复杂性,证明其可由恒定深度阈值电路模拟,并确定了实现近二次计算效率的条件,通过低秩近似验证了理论结果。
English: This study analyzes the circuit complexity of the FlowAR architecture, showing it is simulable by constant-depth threshold circuits and identifying conditions for near-quadratic computational efficiency, with theoretical validation through low-rank approximations.

Authors:Yang Cao, Bo Chen, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
Title: Force Matching with Relativistic Constraints: A Physics-Inspired Approach to Stable and Efficient Generative Modeling
Abstract:
This paper introduces Force Matching (ForM), a novel framework for generative modeling that represents an initial exploration into leveraging special relativistic mechanics to enhance the stability of the sampling process. By incorporating the Lorentz factor, ForM imposes a velocity constraint, ensuring that sample velocities remain bounded within a constant limit. This constraint serves as a fundamental mechanism for stabilizing the generative dynamics, leading to a more robust and controlled sampling process. We provide a rigorous theoretical analysis demonstrating that the velocity constraint is preserved throughout the sampling procedure within the ForM framework. To validate the effectiveness of our approach, we conduct extensive empirical evaluations. On the \textit{half-moons} dataset, ForM significantly outperforms baseline methods, achieving the lowest Euclidean distance loss of \textbf{0.714}, in contrast to vanilla first-order flow matching (5.853) and first- and second-order flow matching (5.793). Additionally, we perform an ablation study to further investigate the impact of our velocity constraint, reaffirming the superiority of ForM in stabilizing the generative process. The theoretical guarantees and empirical results underscore the potential of integrating special relativity principles into generative modeling. Our findings suggest that ForM provides a promising pathway toward achieving stable, efficient, and flexible generative processes. This work lays the foundation for future advancements in high-dimensional generative modeling, opening new avenues for the application of physical principles in machine learning.
中文: 本文提出力匹配框架,通过引入狭义相对论中的洛伦兹因子施加速度约束来稳定生成采样过程,在半环形数据集上表现优异,并通过理论分析和实验验证了其有效性。
English: This paper presents Force Matching (ForM), a generative modeling framework that uses special relativity's Lorentz factor to impose velocity constraints for stabilizing sampling dynamics, achieving superior performance on the half-moons dataset with strong theoretical and empirical validation.

Authors:Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song
Title: Universal Approximation of Visual Autoregressive Transformers
Abstract:
We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any image-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.
Chinese: 本研究证明,即使采用极简架构的视觉自回归变换器(VAR)也具备通用逼近能力,其创新的可扩展由粗到细图像生成框架在图像合成任务中创造了新的性能标杆。
English: This study demonstrates that Visual Autoregressive (VAR) transformers, which introduce a scalable coarse-to-fine framework for image generation, achieve universal approximation capabilities even with minimal architecture and set new performance benchmarks in image synthesis tasks.

Authors:Yekun Ke, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song
Title: DPBloomfilter: Securing Bloom Filters with Differential Privacy
Abstract:
The Bloom filter is a simple yet space-efficient probabilistic data structure that supports membership queries for dramatically large datasets. It is widely utilized and implemented across various industrial scenarios, often handling massive datasets that include sensitive user information necessitating privacy preservation. To address the challenge of maintaining privacy within the Bloom filter, we have developed the DPBloomfilter. This innovation integrates the classical differential privacy mechanism, specifically the Random Response technique, into the Bloom filter, offering robust privacy guarantees under the same running complexity as the standard Bloom filter. Through rigorous simulation experiments, we have demonstrated that our DPBloomfilter algorithm maintains high utility while ensuring privacy protections. To the best of our knowledge, this is the first work to provide differential privacy guarantees for the Bloom filter for membership query problems.
Chinese: DPBloomfilter 将差分隐私机制融入布隆过滤器,在保持高效运行的同时为成员查询提供隐私保护,并通过实验验证了其有效性和实用性。
English: The DPBloomfilter integrates differential privacy with the Bloom filter to protect sensitive data in membership queries while maintaining efficiency and utility, as demonstrated through simulations.

Authors:Yuefan Cao, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Jiahao Zhang
Title: Dissecting Submission Limit in Desk-Rejections: A Mathematical Analysis of Fairness in AI Conference Policies
Abstract:
As AI research surges in both impact and volume, conferences have imposed submission limits to maintain paper quality and alleviate organizational pressure. In this work, we examine the fairness of desk-rejection systems under submission limits and reveal that existing practices can result in substantial inequities. Specifically, we formally define the paper submission limit problem and identify a critical dilemma: when the number of authors exceeds three, it becomes impossible to reject papers solely based on excessive submissions without negatively impacting innocent authors. Thus, this issue may unfairly affect early-career researchers, as their submissions may be penalized due to co-authors with significantly higher submission counts, while senior researchers with numerous papers face minimal consequences. To address this, we propose an optimization-based fairness-aware desk-rejection mechanism and formally define two fairness metrics: individual fairness and group fairness. We prove that optimizing individual fairness is NP-hard, whereas group fairness can be efficiently optimized via linear programming. Through case studies, we demonstrate that our proposed system ensures greater equity than existing methods, including those used in CVPR 2025, offering a more socially just approach to managing excessive submissions in AI conferences.
中文摘要:本研究揭示了当前投稿限制下的桌面拒稿系统存在显著不公,尤其对早期研究者不利,并提出了一种基于优化的公平机制,通过群体公平性指标有效提升决策公正性。
English Summary: This study highlights the inequities in current desk-rejection systems under submission limits, showing they disproportionately penalize early-career researchers, and proposes a fairness-aware mechanism that ensures greater equity through optimized group fairness metrics.

Authors:Bo Chen, Chengyue Gong, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
Title: High-Order Matching for One-Step Shortcut Diffusion Models
Abstract:
One-step shortcut diffusion models [Frans, Hafner, Levine and Abbeel, ICLR 2025] have shown potential in vision generation, but their reliance on first-order trajectory supervision is fundamentally limited. The Shortcut model's simplistic velocity-only approach fails to capture intrinsic manifold geometry, leading to erratic trajectories, poor geometric alignment, and instability-especially in high-curvature regions. These shortcomings stem from its inability to model mid-horizon dependencies or complex distributional features, leaving it ill-equipped for robust generative modeling. In this work, we introduce HOMO (High-Order Matching for One-Step Shortcut Diffusion), a game-changing framework that leverages high-order supervision to revolutionize distribution transportation. By incorporating acceleration, jerk, and beyond, HOMO not only fixes the flaws of the Shortcut model but also achieves unprecedented smoothness, stability, and geometric precision. Theoretically, we prove that HOMO's high-order supervision ensures superior approximation accuracy, outperforming first-order methods. Empirically, HOMO dominates in complex settings, particularly in high-curvature regions where the Shortcut model struggles. Our experiments show that HOMO delivers smoother trajectories and better distributional alignment, setting a new standard for one-step generative models.
中文摘要:HOMO框架通过引入高阶监督机制,从根本上解决了单步捷径扩散模型在几何对齐和稳定性方面的缺陷,为生成模型设立了新标准。
English Summary: The HOMO framework introduces high-order supervision to overcome the limitations of first-order shortcut diffusion models, achieving superior smoothness, stability, and geometric precision in one-step generative modeling.

Authors:Ruiqi Yan, Xiquan Li, Wenxi Chen, Zhikang Niu, Chen Yang, Ziyang Ma, Kai Yu, Xie Chen
Title: URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models
Abstract:
Recent advances in large language models (LLMs) have driven significant progress in end-to-end spoken dialogue models (SDMs). In contrast to text-based LLMs, the evaluation framework for SDMs should encompass both cognitive dimensions (e.g., logical reasoning, knowledge) and speech-related aspects (e.g., paralinguistic cues, audio quality). However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios. To address this gap, we propose URO-Bench, an extensive benchmark for SDMs. Notably, URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics. Our benchmark is divided into two difficulty levels: basic track and pro track, each comprising 20 test sets, evaluating the spoken dialogue model's abilities in Understanding, Reasoning, and Oral conversation. Evaluations on our proposed benchmark reveal that current open-source SDMs perform rather well in daily QA tasks, but lag behind their backbone LLMs in terms of instruction-following ability and also suffer from catastrophic forgetting. Their performance in advanced evaluations of paralinguistic information and audio understanding remains subpar, highlighting the need for further research in this direction. We hope that URO-Bench can facilitate the development of spoken dialogue models by providing a multifaceted evaluation of existing models and helping to track progress in this area.
Chinese: 大型语言模型的最新进展推动了端到端口语对话模型的发展,但缺乏全面的评估框架,为此我们提出了首个涵盖多语言、多轮对话和副语言信息的语音转语音基准URO-Bench,用于评估模型的理解、推理和口语对话能力。
English: Recent advances in large language models have spurred progress in spoken dialogue models, yet a comprehensive evaluation framework is lacking, which URO-Bench addresses by offering the first speech-to-speech benchmark covering multilingualism, multi-round dialogues, and paralinguistics to assess understanding, reasoning, and oral conversation abilities.

Authors:Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You
Title: Enhance-A-Video: Better Generated Video for Free
Abstract:
DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both temporal consistency and visual quality. We hope this research can inspire future explorations in video generation enhancement.
中文: 本文提出Enhance-A-Video这一免训练方法,通过非对角时序注意力分布增强帧间关联性,无需重新训练即可提升DiT视频生成模型的时间一致性与视觉质量。
English: This paper introduces Enhance-A-Video, a training-free method that improves DiT-based video generation by strengthening cross-frame correlations through non-diagonal temporal attention distributions, enhancing both temporal consistency and visual quality without requiring model retraining.

Authors:Jiaqi Bai, Hongcheng Guo, Zhongyuan Peng, Jian Yang, Zhoujun Li, Mohan Li, Zhihong Tian
Title: Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow
Abstract:
Large vision-language models show tremendous potential in understanding visual information through human languages. However, they are prone to suffer from object hallucination, i.e., the generated image descriptions contain objects that do not exist in the image. In this paper, we reveal that object hallucination can be attributed to overconfidence in irrelevant visual features when soft visual tokens map to the LLM's word embedding space. Specifically, by figuring out the semantic similarity between visual tokens and LLM's word embedding, we observe that the smoothness of similarity distribution strongly correlates with the emergence of object hallucinations. To mitigate hallucinations, we propose using the Variational Information Bottleneck (VIB) to alleviate overconfidence by introducing stochastic noise, facilitating the constraining of irrelevant information. Furthermore, we propose an entropy-based noise-controlling strategy to enable the injected noise to be adaptively constrained regarding the smoothness of the similarity distribution. We adapt the proposed AdaVIB across distinct model architectures. Experimental results demonstrate that the proposed AdaVIB mitigates object hallucinations by effectively alleviating the overconfidence in irrelevant visual features, with consistent improvements on two object hallucination benchmarks.
中文: 大型视觉语言模型常因对无关视觉特征的过度自信而产生物体幻觉,但提出的AdaVIB方法通过变分信息瓶颈自适应约束信息,有效缓解了这一问题,并在两个基准测试中取得稳定提升。
English: Large vision-language models often generate object hallucinations due to overconfidence in irrelevant visual features, but the proposed AdaVIB method mitigates this by adaptively constraining information with variational noise, showing consistent improvements on benchmarks.

Authors:Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Yanfang Ye, Toby Jia-Jun Li, Dakuo Wang
Title: Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents
Abstract:
Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that simulates human-like behaviors in a variety of tasks. However, evaluating RPAs is challenging due to diverse task requirements and agent designs. This paper proposes an evidence-based, actionable, and generalizable evaluation design guideline for LLM-based RPA by systematically reviewing 1,676 papers published between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes, seven task attributes, and seven evaluation metrics from existing literature. Based on these findings, we present an RPA evaluation design guideline to help researchers develop more systematic and consistent evaluation methods.
Chinese: 本文通过分析1,676篇文献,提出了基于证据、可操作且可推广的大型语言模型角色扮演代理评估设计指南,识别出关键代理属性和任务属性及评估指标,以促进系统化和一致性的评估方法发展。
English: This paper introduces an evidence-based, actionable, and generalizable evaluation design guideline for LLM-based Role-Playing Agents (RPAs) by analyzing 1,676 papers, identifying key agent and task attributes along with evaluation metrics to promote systematic and consistent assessment methods.

Authors:Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, Dakuo Wang
Title: UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design
Abstract:
Usability testing is a fundamental yet challenging (e.g., inflexible to iterate the study design flaws and hard to recruit study participants) research method for user experience (UX) researchers to evaluate a web design. Recent advances in Large Language Model-simulated Agent (LLM-Agent) research inspired us to design UXAgent to support UX researchers in evaluating and reiterating their usability testing study design before they conduct the real human subject study. Our system features an LLM-Agent module and a universal browser connector module so that UX researchers can automatically generate thousands of simulated users to test the target website. The results are shown in qualitative (e.g., interviewing how an agent thinks ), quantitative (e.g., # of actions), and video recording formats for UX researchers to analyze. Through a heuristic user evaluation with five UX researchers, participants praised the innovation of our system but also expressed concerns about the future of LLM Agent-assisted UX study.
Chinese: UXAgent是一个创新系统,利用大语言模型智能体模拟数千用户进行可用性测试,使UX研究人员能在开展真人研究前通过定性、定量和视频反馈迭代评估网页设计。
English: UXAgent is an innovative system utilizing LLM-Agents to simulate thousands of users for usability testing, enabling UX researchers to iteratively evaluate web designs through qualitative, quantitative, and video feedback before conducting human studies.

Authors:Ying Lei, Yancheng Cao, Will Wang, Yuanzhe Dong, Changchang Yin, Weidan Cao, Ping Zhang, Jingzhen Yang, Bingsheng Yao, Yifan Peng, Chunhua Weng, Randy Auerbach, Lena Mamykina, Dakuo Wang, Yuntao Wang, Xuhai Xu
Title: WatchGuardian: Enabling User-Defined Personalized Just-in-Time Intervention on Smartwatch
Abstract:
While just-in-time interventions (JITIs) have effectively targeted common health behaviors, individuals often have unique needs to intervene in personal undesirable actions that can negatively affect physical, mental, and social well-being. We present WatchGuardian, a smartwatch-based JITI system that empowers users to define custom interventions for these personal actions with a small number of samples. For the model to detect new actions based on limited new data samples, we developed a few-shot learning pipeline that finetuned a pre-trained inertial measurement unit (IMU) model on public hand-gesture datasets. We then designed a data augmentation and synthesis process to train additional classification layers for customization. Our offline evaluation with 26 participants showed that with three, five, and ten examples, our approach achieved an average accuracy of 76.8%, 84.7%, and 87.7%, and an F1 score of 74.8%, 84.2%, and 87.2% We then conducted a four-hour intervention study to compare WatchGuardian against a rule-based intervention. Our results demonstrated that our system led to a significant reduction by 64.0 +- 22.6% in undesirable actions, substantially outperforming the baseline by 29.0%. Our findings underscore the effectiveness of a customizable, AI-driven JITI system for individuals in need of behavioral intervention in personal undesirable actions. We envision that our work can inspire broader applications of user-defined personalized intervention with advanced AI solutions.
中文摘要:WatchGuardian是一款基于智能手表的即时干预系统,通过小样本学习让用户自定义针对不良行为的个性化干预,在研究中实现了64.0%的行为显著减少。
English Summary: WatchGuardian is a smartwatch-based just-in-time intervention system that allows users to create personalized interventions for undesirable behaviors using few-shot learning, achieving significant action reduction of 64.0% in studies.

Authors:Ziqi Yang, Yuxuan Lu, Jennifer Bagdasarian, Vedant Das Swain, Ritu Agarwal, Collin Campbell, Waddah Al-Refaire, Jehan El-Bayoumi, Guodong Gao, Dakuo Wang, Bingsheng Yao, Nawar Shara
Title: RECOVER: Designing a Large Language Model-based Remote Patient Monitoring System for Postoperative Gastrointestinal Cancer Care
Abstract:
Cancer surgery is a key treatment for gastrointestinal (GI) cancers, a group of cancers that account for more than 35% of cancer-related deaths worldwide, but postoperative complications are unpredictable and can be life-threatening. In this paper, we investigate how recent advancements in large language models (LLMs) can benefit remote patient monitoring (RPM) systems through clinical integration by designing RECOVER, an LLM-powered RPM system for postoperative GI cancer care. To closely engage stakeholders in the design process, we first conducted seven participatory design sessions with five clinical staff and interviewed five cancer patients to derive six major design strategies for integrating clinical guidelines and information needs into LLM-based RPM systems. We then designed and implemented RECOVER, which features an LLM-powered conversational agent for cancer patients and an interactive dashboard for clinical staff to enable efficient postoperative RPM. Finally, we used RECOVER as a pilot system to assess the implementation of our design strategies with four clinical staff and five patients, providing design implications by identifying crucial design elements, offering insights on responsible AI, and outlining opportunities for future LLM-powered RPM systems.
中文摘要:本研究开发了基于大语言模型的RECOVER远程患者监护系统,通过利益相关者参与设计并评估其在胃肠癌术后护理中的应用,为未来AI医疗系统提供了关键设计要素与实施策略。
English Summary: This study introduces RECOVER, an LLM-powered remote patient monitoring system designed for postoperative gastrointestinal cancer care, developed through stakeholder engagement and evaluated to provide design insights for future AI-enhanced healthcare systems.

Authors:Shihan Fu, Bingsheng Yao, Smit Desai, Yuqi Hu, Yuling Sun, Samantha Stonbraker, Yanjun Gao, Elizabeth M. Goldberg, Dakuo Wang
Title: "It Felt Like I Was Left in the Dark": Exploring Information Needs and Design Opportunities for Family Caregivers of Older Adult Patients in Critical Care Settings
Abstract:
Older adult patients constitute a rapidly growing subgroup of Intensive Care Unit (ICU) patients. In these situations, their family caregivers are expected to represent the unconscious patients to access and interpret patients' medical information. However, caregivers currently have to rely on overloaded clinicians for information updates and typically lack the health literacy to understand complex medical information. Our project aims to explore the information needs of caregivers of ICU older adult patients, from which we can propose design opportunities to guide future AI systems. The project begins with formative interviews with 11 caregivers to identify their challenges in accessing and interpreting medical information; From these findings, we then synthesize design requirements and propose an AI system prototype to cope with caregivers' challenges. The system prototype has two key features: a timeline visualization to show the AI extracted and summarized older adult patients' key medical events; and an LLM-based chatbot to provide context-aware informational support. We conclude our paper by reporting on the follow-up user evaluation of the system and discussing future AI-based systems for ICU caregivers of older adults.
中文摘要:本项目针对老年ICU患者家属照护者的信息获取困境,开发了一个具备医疗事件时间轴和基于大语言模型的聊天机器人功能的AI系统原型,以提供有效的信息支持。
English Summary: This project identifies the information challenges faced by family caregivers of elderly ICU patients and develops an AI system prototype featuring a medical event timeline and an LLM-powered chatbot to address their needs.

Authors:Jun Xu, Mengshu Sun, Zhiqiang Zhang, Jun Zhou
Title: MAQInstruct: Instruction-based Unified Event Relation Extraction
Abstract:
Extracting event relations that deviate from known schemas has proven challenging for previous methods based on multi-class classification, MASK prediction, or prototype matching. Recent advancements in large language models have shown impressive performance through instruction tuning. Nevertheless, in the task of event relation extraction, instruction-based methods face several challenges: there are a vast number of inference samples, and the relations between events are non-sequential. To tackle these challenges, we present an improved instruction-based event relation extraction framework named MAQInstruct. Firstly, we transform the task from extracting event relations using given event-event instructions to selecting events using given event-relation instructions, which reduces the number of samples required for inference. Then, by incorporating a bipartite matching loss, we reduce the dependency of the instruction-based method on the generation sequence. Our experimental results demonstrate that MAQInstruct significantly improves the performance of event relation extraction across multiple LLMs.
中文:MAQInstruct通过将事件关系抽取任务转化为基于事件-关系指令的事件选择,减少推理样本数量,并引入二分图匹配损失降低对生成序列的依赖,从而显著提升了多种大语言模型在事件关系抽取上的性能。
English: MAQInstruct enhances event relation extraction by converting the task to event selection with event-relation instructions, reducing inference samples, and using bipartite matching loss to lessen sequence dependency, achieving superior performance across various large language models.

Authors:Lin Yuan, Jun Xu, Honghao Gui, Mengshu Sun, Zhiqiang Zhang, Lei Liang, Jun Zhou
Title: Improving Natural Language Understanding for LLMs via Large-Scale Instruction Synthesis
Abstract:
High-quality, large-scale instructions are crucial for aligning large language models (LLMs), however, there is a severe shortage of instruction in the field of natural language understanding (NLU). Previous works on constructing NLU instructions mainly focus on information extraction (IE), neglecting tasks such as machine reading comprehension, question answering, and text classification. Furthermore, the lack of diversity in the data has led to a decreased generalization ability of trained LLMs in other NLU tasks and a noticeable decline in the fundamental model's general capabilities. To address this issue, we propose Hum, a large-scale, high-quality synthetic instruction corpus for NLU tasks, designed to enhance the NLU capabilities of LLMs. Specifically, Hum includes IE (either close IE or open IE), machine reading comprehension, text classification, and instruction generalist tasks, thereby enriching task diversity. Additionally, we introduce a human-LLMs collaborative mechanism to synthesize instructions, which enriches instruction diversity by incorporating guidelines, preference rules, and format variants. We conduct extensive experiments on 5 NLU tasks and 28 general capability evaluation datasets for LLMs. Experimental results show that Hum enhances the NLU capabilities of six LLMs by an average of 3.1\%, with no significant decline observed in other general capabilities.
Chinese: 为解决自然语言理解任务中高质量、多样化指令的匮乏问题,我们提出了Hum——一个通过人机协作构建的大规模合成指令语料库,它在提升大型语言模型NLU能力3.1%的同时,保持了模型其他通用能力的稳定。
English: To address the shortage of diverse and high-quality instructions for natural language understanding (NLU) tasks, we introduce Hum, a synthetic instruction corpus developed through human-LLM collaboration that enhances LLMs' NLU capabilities by 3.1% on average without compromising their general performance.

Authors:Bingsheng Yao, Menglin Zhao, Yuling Sun, Weidan Cao, Changchang Yin, Stephen Intille, Xuhai Xu, Ping Zhang, Jingzhen Yang, Dakuo Wang
Title: More Modality, More AI: Exploring Design Opportunities of AI-Based Multi-modal Remote Monitoring Technologies for Early Detection of Mental Health Sequelae in Youth Concussion Patients
Abstract:
Anxiety, depression, and suicidality are common mental health sequelae following concussion in youth patients, often exacerbating concussion symptoms and prolonging recovery. Despite the critical need for early detection of these mental health symptoms, clinicians often face challenges in accurately collecting patients' mental health data and making clinical decision-making in a timely manner. Today's remote patient monitoring (RPM) technologies offer opportunities to objectively monitor patients' activities, but they were not specifically designed for youth concussion patients; moreover, the large amount of data collected by RPM technologies may also impose significant workloads on clinicians to keep up with and use the data. To address these gaps, we employed a three-stage study consisting of a formative study, interface design, and design evaluation. We first conducted a formative study through semi-structured interviews with six highly professional concussion clinicians and identified clinicians' key challenges in remotely collecting patient information and accessing patient treatment compliance. Subsequently, we proposed preliminary clinician-facing interface designs with the integration of AI-based RPM technologies (AI-RPM), followed by design evaluation sessions with highly professional concussion clinicians. Clinicians underscored the value of integrating multi-modal AI-RPM technologies to support their decision-making while emphasizing the importance of customizable interfaces through collaborative design and multiple responsible design considerations.
中文摘要:本研究通过临床医生访谈和迭代设计开发了AI增强的远程患者监测界面,旨在解决青少年脑震荡康复中精神健康跟踪的难题,并强调可定制化与负责任的设计实施。
English Summary: This study developed AI-enhanced remote patient monitoring interfaces through clinician interviews and iterative design to address mental health tracking challenges in youth concussion recovery, emphasizing customizable and responsible implementation.

Authors:Yiyang Zhu, Jiayi Zhang, Enyu Shi, Ziheng Liu, Chau Yuen, Bo Ai
Title: Joint Power Allocation and Phase Shift Design for Stacked Intelligent Metasurfaces-aided Cell-Free Massive MIMO Systems with MARL
Abstract:
Cell-free (CF) massive multiple-input multiple-output (mMIMO) systems offer high spectral efficiency (SE) through multiple distributed access points (APs). However, the large number of antennas increases power consumption. We propose incorporating stacked intelligent metasurfaces (SIM) into CF mMIMO systems as a cost-effective, energy-efficient solution. This paper focuses on optimizing the joint power allocation of APs and the phase shift of SIMs to maximize the sum SE. To address this complex problem, we introduce a fully distributed multi-agent reinforcement learning (MARL) algorithm. Our novel algorithm, the noisy value method with a recurrent policy in multi-agent policy optimization (NVR-MAPPO), enhances performance by encouraging diverse exploration under centralized training and decentralized execution. Simulations demonstrate that NVR-MAPPO significantly improves sum SE and robustness across various scenarios.
中文摘要:本文提出一种完全分布式多智能体强化学习算法NVR-MAPPO,通过优化无蜂窝大规模MIMO系统中接入点功率分配与智能超表面相位偏移,在多种场景下显著提升了系统和频谱效率与鲁棒性。
English Summary: This paper introduces a fully distributed multi-agent reinforcement learning algorithm, NVR-MAPPO, to optimize joint power allocation and phase shifts in cell-free massive MIMO systems integrated with stacked intelligent metasurfaces, significantly enhancing sum spectral efficiency and system robustness.

Authors:Maria Tsampazi, Michele Polese, Falko Dressler, Tommaso Melodia
Title: O-RIS-ing: Evaluating RIS-Assisted NextG Open RAN
Abstract:
Reconfigurable Intelligent Surfaces (RISs) pose as a transformative technology to revolutionize the cellular architecture of Next Generation (NextG) Radio Access Networks (RANs). Previous studies have demonstrated the capabilities of RISs in optimizing wireless propagation, achieving high spectral efficiency, and improving resource utilization. At the same time, the transition to softwarized, disaggregated, and virtualized architectures, such as those being standardized by the O-RAN ALLIANCE, enables the vision of a reconfigurable Open RAN. In this work, we aim to integrate these technologies by studying how different resource allocation policies enhance the performance of RIS-assisted Open RANs. We perform a comparative analysis among various network configurations and show how proper network optimization can enhance the performance across the Enhanced Mobile Broadband (eMBB) and Ultra Reliable and Low Latency Communications (URLLC) network slices, achieving up to ~34% throughput improvement. Furthermore, leveraging the capabilities of OpenRAN Gym, we deploy an xApp on Colosseum, the world's largest wireless system emulator with hardware-in-the-loop, to control the Base Station (BS)'s scheduling policy. Experimental results demonstrate that RIS-assisted topologies achieve high resource efficiency and low latency, regardless of the BS's scheduling policy.
Chinese: 可重构智能表面(RIS)通过与开放无线接入网架构融合,显著提升了下一代蜂窝网络的性能,在多种网络切片中实现了高达约34%的吞吐量增长及高效低延迟通信。
English: Reconfigurable Intelligent Surfaces (RISs) enhance NextG cellular networks by integrating with Open RAN architectures, improving throughput by up to 34% and achieving high efficiency and low latency across diverse network slices.

Authors:Xuxu Liu, Siyuan Liang, Mengya Han, Yong Luo, Aishan Liu, Xiantao Cai, Zheng He, Dacheng Tao
Title: ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models
Abstract:
Generative large language models are crucial in natural language processing, but they are vulnerable to backdoor attacks, where subtle triggers compromise their behavior. Although backdoor attacks against LLMs are constantly emerging, existing benchmarks remain limited in terms of sufficient coverage of attack, metric system integrity, backdoor attack alignment. And existing pre-trained backdoor attacks are idealized in practice due to resource access constraints. Therefore we establish $\textit{ELBA-Bench}$, a comprehensive and unified framework that allows attackers to inject backdoor through parameter efficient fine-tuning ($\textit{e.g.,}$ LoRA) or without fine-tuning techniques ($\textit{e.g.,}$ In-context-learning). $\textit{ELBA-Bench}$ provides over 1300 experiments encompassing the implementations of 12 attack methods, 18 datasets, and 12 LLMs. Extensive experiments provide new invaluable findings into the strengths and limitations of various attack strategies. For instance, PEFT attack consistently outperform without fine-tuning approaches in classification tasks while showing strong cross-dataset generalization with optimized triggers boosting robustness; Task-relevant backdoor optimization techniques or attack prompts along with clean and adversarial demonstrations can enhance backdoor attack success while preserving model performance on clean samples. Additionally, we introduce a universal toolbox designed for standardized backdoor attack research, with the goal of propelling further progress in this vital area.
中文摘要:本研究提出了ELBA-Bench这一评估大语言模型后门攻击的综合框架,通过大量实验揭示了不同攻击策略的有效性及模型脆弱性的关键发现。
English Summary: The study introduces ELBA-Bench, a comprehensive framework for evaluating backdoor attacks on large language models, featuring extensive experiments that reveal key insights into attack effectiveness and model vulnerabilities.

Authors:Yu Zhou, Bingxuan Li, Mohan Tang, Xiaomeng Jin, Te-Lin Wu, Kuan-Hao Huang, Heng Ji, Kai-Wei Chang, Nanyun Peng
Title: Contrastive Visual Data Augmentation
Abstract:
Large multimodal models (LMMs) often struggle to recognize novel concepts, as they rely on pre-trained knowledge and have limited ability to capture subtle visual details. Domain-specific knowledge gaps in training also make them prone to confusing visually similar, commonly misrepresented, or low-resource concepts. To help LMMs better align nuanced visual features with language, improving their ability to recognize and reason about novel or rare concepts, we propose a Contrastive visual Data Augmentation (CoDA) strategy. CoDA extracts key contrastive textual and visual features of target concepts against the known concepts they are misrecognized as, and then uses multimodal generative models to produce targeted synthetic data. Automatic filtering of extracted features and augmented images is implemented to guarantee their quality, as verified by human annotators. We show the effectiveness and efficiency of CoDA on low-resource concept and diverse scene recognition datasets including INaturalist and SUN. We additionally collect NovelSpecies, a benchmark dataset consisting of newly discovered animal species that are guaranteed to be unseen by LMMs. LLaVA-1.6 1-shot updating results on these three datasets show CoDA significantly improves SOTA visual data augmentation strategies by 12.3% (NovelSpecies), 5.1% (SUN), and 6.0% (iNat) absolute gains in accuracy.
Chinese: 提出的对比视觉数据增强(CoDA)策略通过生成针对性的合成数据,使大型多模态模型能更好地将细微视觉特征与语言对齐,从而显著提升了对多个数据集中新颖概念的识别准确率。
English: The proposed Contrastive visual Data Augmentation (CoDA) strategy enhances large multimodal models' recognition of novel concepts by generating targeted synthetic data that aligns nuanced visual features with language, achieving significant accuracy improvements across multiple datasets.

Authors:Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, Nanyun Peng
Title: METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
Abstract:
Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation. Generating high-quality charts requires both strong visual design skills and precise coding capabilities that embed the desired visual properties into code. Such a complex multi-modal reasoning process is difficult for direct prompting of VLMs. To resolve these challenges, we propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents. METAL achieves 5.2% improvement over the current best result in the chart generation task. The METAL framework exhibits the phenomenon of test-time scaling: its performance increases monotonically as the logarithmic computational budget grows from 512 to 8192 tokens. In addition, we find that separating different modalities during the critique process of METAL boosts the self-correction capability of VLMs in the multimodal context.
中文摘要:本文提出METAL多智能体框架,通过将图表生成任务分解为专业智能体间的迭代协作,实现了5.2%的性能提升,并展现出计算资源增加时的测试时扩展特性。
English Summary: This paper introduces METAL, a multi-agent framework that enhances chart generation by decomposing the task into specialized agents' iterative collaboration, achieving a 5.2% performance improvement and demonstrating test-time scaling with increased computational budgets.

Authors:Zetian Sun, Dongfang Li, Baotian Hu, Jun Yu, Min Zhang
Title: Improving Value-based Process Verifier via Structural Prior Injection
Abstract:
In the Large Language Model(LLM) reasoning scenario, people often estimate state value via Monte Carlo sampling. Though Monte Carlo estimation is an elegant method with less inductive bias, noise and errors are inevitably introduced due to the limited sampling. To handle the problem, we inject the structural prior into the value representation and transfer the scalar value into the expectation of a pre-defined categorical distribution, representing the noise and errors from a distribution perspective. Specifically, by treating the result of Monte Carlo sampling as a single sample from the prior ground-truth Binomial distribution, we quantify the sampling error as the mismatch between posterior estimated distribution and ground-truth distribution, which is thus optimized via distribution selection optimization. We test the performance of value-based process verifiers on Best-of-N task and Beam search task. Compared with the scalar value representation, we show that reasonable structural prior injection induced by different objective functions or optimization methods can improve the performance of value-based process verifiers for about 1$\sim$2 points at little-to-no cost. We also show that under different structural prior, the verifiers' performances vary greatly despite having the same optimal solution, indicating the importance of reasonable structural prior injection.
中文摘要: 本研究通过将结构先验注入价值表示,将标量值转化为预定义分类分布的期望值,有效减少了蒙特卡洛采样误差,以极低成本提升了基于价值的流程验证器性能约1-2个百分点。
English Summary: This study enhances LLM reasoning by incorporating structural priors into value representation, transforming scalar values into categorical distributions to mitigate Monte Carlo sampling errors and improve verification performance at minimal cost.

Authors:Tanmay Parekh, Yuxuan Dong, Lucas Bandarkar, Artin Kim, I-Hung Hsu, Kai-Wei Chang, Nanyun Peng
Title: SNaRe: Domain-aware Data Generation for Low-Resource Event Detection
Abstract:
Event Detection (ED) -- the task of identifying event mentions from natural language text -- is critical for enabling reasoning in highly specialized domains such as biomedicine, law, and epidemiology. Data generation has proven to be effective in broadening its utility to wider applications without requiring expensive expert annotations. However, when existing generation approaches are applied to specialized domains, they struggle with label noise, where annotations are incorrect, and domain drift, characterized by a distributional mismatch between generated sentences and the target domain. To address these issues, we introduce SNaRe, a domain-aware synthetic data generation framework composed of three components: Scout, Narrator, and Refiner. Scout extracts triggers from unlabeled target domain data and curates a high-quality domain-specific trigger list using corpus-level statistics to mitigate domain drift. Narrator, conditioned on these triggers, generates high-quality domain-aligned sentences, and Refiner identifies additional event mentions, ensuring high annotation quality. Experimentation on three diverse domain ED datasets reveals how SNaRe outperforms the best baseline, achieving average F1 gains of 3-7% in the zero-shot/few-shot settings and 4-20% F1 improvement for multilingual generation. Analyzing the generated trigger hit rate and human evaluation substantiates SNaRe's stronger annotation quality and reduced domain drift.
中文: SNaRe框架通过Scout提取领域特定触发词、Narrator生成对齐句子、Refiner优化标注质量,有效解决了专业领域事件检测中的标签噪声和领域偏移问题,在零样本/少样本及多语言生成任务中实现了显著的F1分数提升。
English: The SNaRe framework addresses label noise and domain drift in specialized event detection by using Scout to curate domain-specific triggers, Narrator to generate aligned sentences, and Refiner to ensure annotation quality, achieving significant F1 score improvements in zero-shot, few-shot, and multilingual settings.

Authors:Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue
Title: Audio-FLAN: A Preliminary Release
Abstract:
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub and will be continuously updated.
中文: Audio-FLAN 是一个大规模指令调优数据集,旨在统一音频理解与生成任务,为零样本跨领域音频语言模型的开发奠定了基础。
English: Audio-FLAN is a large-scale dataset introduced to unify audio understanding and generation tasks, enabling the development of zero-shot capable audio-language models across diverse domains.

Authors:Ziwei Shan, Yaoyu He, Chengfeng Zhao, Jiashen Du, Jingyan Zhang, Qixuan Zhang, Jingyi Yu, Lan Xu
Title: Mojito: LLM-Aided Motion Instructor with Jitter-Reduced Inertial Tokens
Abstract:
Human bodily movements convey critical insights into action intentions and cognitive processes, yet existing multimodal systems primarily focused on understanding human motion via language, vision, and audio, which struggle to capture the dynamic forces and torques inherent in 3D motion. Inertial measurement units (IMUs) present a promising alternative, offering lightweight, wearable, and privacy-conscious motion sensing. However, processing of streaming IMU data faces challenges such as wireless transmission instability, sensor noise, and drift, limiting their utility for long-term real-time motion capture (MoCap), and more importantly, online motion analysis. To address these challenges, we introduce Mojito, an intelligent motion agent that integrates inertial sensing with large language models (LLMs) for interactive motion capture and behavioral analysis.
中文摘要:该摘要提出Mojito智能运动代理,通过融合惯性传感与大语言模型,解决了传统方法难以捕捉三维运动动态力并实现实时交互式运动分析的局限性。
English Summary: The abstract introduces Mojito, an intelligent motion agent that combines inertial sensing with large language models to overcome limitations in capturing dynamic 3D motion forces and enable interactive motion analysis.

Authors:Eduardo Baena, Paolo Testolina, Michele Polese, Dimitrios Koutsonikolas, Josep Jornet, Tommaso Melodia
Title: Space-O-RAN: Enabling Intelligent, Open, and Interoperable Non Terrestrial Networks in 6G
Abstract:
Satellite networks are rapidly evolving, yet most \glspl{ntn} remain isolated from terrestrial orchestration frameworks. Their control architectures are typically monolithic and static, limiting their adaptability to dynamic traffic, topology changes, and mission requirements. These constraints lead to inefficient spectrum use and underutilized network capacity. Although \gls{ai} promises automation, its deployment in orbit is limited by computing, energy, and connectivity limitations. This paper introduces Space-O-RAN, a distributed control architecture that extends Open RAN principles into satellite constellations through hierarchical, closed-loop control. Lightweight \glspl{dapp} operate onboard satellites, enabling real-time functions like scheduling and beam steering without relying on persistent ground access. Cluster-level coordination is managed via \glspl{spaceric}, which leverage low-latency \glspl{isl} for autonomous decisions in orbit. Strategic tasks, including AI training and policy updates, are transferred to terrestrial platforms \glspl{smo} using digital twins and feeder links. A key enabler is the dynamic mapping of the O-RAN interfaces to satellite links, supporting adaptive signaling under varying conditions. Simulations using the Starlink topology validate the latency bounds that inform this architectural split, demonstrating both feasibility and scalability for autonomous satellite RAN operations.
中文摘要:卫星网络因孤立且静态的控制架构面临适应性和效率限制,而Space-O-RAN通过星载轻量应用和轨道协同的分布式系统实现自主运行,同时将复杂任务卸载至地面平台。
English Summary: Satellite networks face limitations in adaptability and efficiency due to isolated, static control architectures, but Space-O-RAN introduces a distributed system using lightweight onboard applications and orbital coordination to enable autonomous operations while offloading complex tasks to ground platforms.

Authors:Kaixin Yao, Longwen Zhang, Xinhao Yan, Yan Zeng, Qixuan Zhang, Wei Yang, Lan Xu, Jiayuan Gu, Jingyi Yu
Title: CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image
Abstract:
Recovering high-quality 3D scenes from a single RGB image is a challenging task in computer graphics. Current methods often struggle with domain-specific limitations or low-quality object generation. To address these, we propose CAST (Component-Aligned 3D Scene Reconstruction from a Single RGB Image), a novel method for 3D scene reconstruction and recovery. CAST starts by extracting object-level 2D segmentation and relative depth information from the input image, followed by using a GPT-based model to analyze inter-object spatial relationships. This enables the understanding of how objects relate to each other within the scene, ensuring more coherent reconstruction. CAST then employs an occlusion-aware large-scale 3D generation model to independently generate each object's full geometry, using MAE and point cloud conditioning to mitigate the effects of occlusions and partial object information, ensuring accurate alignment with the source image's geometry and texture. To align each object with the scene, the alignment generation model computes the necessary transformations, allowing the generated meshes to be accurately placed and integrated into the scene's point cloud. Finally, CAST incorporates a physics-aware correction step that leverages a fine-grained relation graph to generate a constraint graph. This graph guides the optimization of object poses, ensuring physical consistency and spatial coherence. By utilizing Signed Distance Fields (SDF), the model effectively addresses issues such as occlusions, object penetration, and floating objects, ensuring that the generated scene accurately reflects real-world physical interactions. CAST can be leveraged in robotics, enabling efficient real-to-simulation workflows and providing realistic, scalable simulation environments for robotic systems.
中文: CAST方法通过对象分割、空间关系分析和物理感知优化,从单张RGB图像重建高质量3D场景,确保几何精度和物理一致性,可应用于机器人仿真等领域。
English: The proposed CAST method reconstructs high-quality 3D scenes from single RGB images by leveraging object segmentation, spatial relationship analysis, and physics-aware optimization to ensure geometric accuracy and physical consistency.

Authors:Long-Tung Vuong, Vy Vo, Hien Dang, Van-Anh Nguyen, Thanh-Toan Do, Mehrtash Harandi, Trung Le, Dinh Phung
Title: Why Domain Generalization Fail? A View of Necessity and Sufficiency
Abstract:
Despite a strong theoretical foundation, empirical experiments reveal that existing domain generalization (DG) algorithms often fail to consistently outperform the ERM baseline. We argue that this issue arises because most DG studies focus on establishing theoretical guarantees for generalization under unrealistic assumptions, such as the availability of sufficient, diverse (or even infinite) domains or access to target domain knowledge. As a result, the extent to which domain generalization is achievable in scenarios with limited domains remains largely unexplored. This paper seeks to address this gap by examining generalization through the lens of the conditions necessary for its existence and learnability. Specifically, we systematically establish a set of necessary and sufficient conditions for generalization. Our analysis highlights that existing DG methods primarily act as regularization mechanisms focused on satisfying sufficient conditions, while often neglecting necessary ones. However, sufficient conditions cannot be verified in settings with limited training domains. In such cases, regularization targeting sufficient conditions aims to maximize the likelihood of generalization, whereas regularization targeting necessary conditions ensures its existence. Using this analysis, we reveal the shortcomings of existing DG algorithms by showing that, while they promote sufficient conditions, they inadvertently violate necessary conditions. To validate our theoretical insights, we propose a practical method that promotes the sufficient condition while maintaining the necessary conditions through a novel subspace representation alignment strategy. This approach highlights the advantages of preserving the necessary conditions on well-established DG benchmarks.
中文: 现有领域泛化方法因依赖不切实际的假设并侧重充分条件而忽视必要条件,导致性能不佳,本研究通过提出一种新的子空间表示对齐策略,在保持必要条件的同时促进充分条件,从而在基准测试中展现优势。
English: Current domain generalization methods often fail to outperform empirical risk minimization due to their focus on unrealistic assumptions and sufficient conditions while neglecting necessary ones, prompting this study to propose a novel alignment strategy that maintains both conditions for improved performance.

Authors:Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li
Title: MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
Abstract:
Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
中文摘要:MME-CoT基准首次系统评估了大型多模态模型的思维链推理能力,发现具备反思机制的模型虽能提高推理质量,但在感知密集型任务中表现下降且存在效率低下的问题。
English Summary: The MME-CoT benchmark systematically evaluates Chain-of-Thought reasoning in Large Multimodal Models, revealing that reflective models achieve higher reasoning quality but suffer from efficiency issues and performance degradation on perception-heavy tasks.

Authors:Zitao Li, Fei Wei, Yuexiang Xie, Dawei Gao, Weirui Kuang, Zhijian Ma, Bingchen Qian, Yaliang Li, Bolin Ding
Title: KIMAs: A Configurable Knowledge Integrated Multi-Agent System
Abstract:
Knowledge-intensive conversations supported by large language models (LLMs) have become one of the most popular and helpful applications that can assist people in different aspects. Many current knowledge-intensive applications are centered on retrieval-augmented generation (RAG) techniques. While many open-source RAG frameworks facilitate the development of RAG-based applications, they often fall short in handling practical scenarios complicated by heterogeneous data in topics and formats, conversational context management, and the requirement of low-latency response times. This technical report presents a configurable knowledge integrated multi-agent system, KIMAs, to address these challenges. KIMAs features a flexible and configurable system for integrating diverse knowledge sources with 1) context management and query rewrite mechanisms to improve retrieval accuracy and multi-turn conversational coherency, 2) efficient knowledge routing and retrieval, 3) simple but effective filter and reference generation mechanisms, and 4) optimized parallelizable multi-agent pipeline execution. Our work provides a scalable framework for advancing the deployment of LLMs in real-world settings. To show how KIMAs can help developers build knowledge-intensive applications with different scales and emphases, we demonstrate how we configure the system to three applications already running in practice with reliable performance.
中文摘要:KIMAs是一个可配置的多智能体系统,通过增强上下文管理、高效知识检索和优化流水线执行,有效解决了现有RAG框架在处理异构数据、多轮对话和低延迟需求方面的不足,为实际应用提供了可扩展的解决方案。
English Summary: KIMAs is a configurable multi-agent system designed to overcome the limitations of current RAG frameworks by integrating diverse knowledge sources with enhanced context management, efficient retrieval, and optimized pipeline execution for scalable real-world applications.

Authors:Yu Hong, Yize Wu, Zhehao Shen, Chengcheng Guo, Yuheng Jiang, Yingliang Zhang, Jingyi Yu, Lan Xu
Title: BEAM: Bridging Physically-based Rendering and Gaussian Modeling for Relightable Volumetric Video
Abstract:
Volumetric video enables immersive experiences by capturing dynamic 3D scenes, enabling diverse applications for virtual reality, education, and telepresence. However, traditional methods struggle with fixed lighting conditions, while neural approaches face trade-offs in efficiency, quality, or adaptability for relightable scenarios. To address these limitations, we present BEAM, a novel pipeline that bridges 4D Gaussian representations with physically-based rendering (PBR) to produce high-quality, relightable volumetric videos from multi-view RGB footage. BEAM recovers detailed geometry and PBR properties via a series of available Gaussian-based techniques. It first combines Gaussian-based human performance tracking with geometry-aware rasterization in a coarse-to-fine optimization framework to recover spatially and temporally consistent geometries. We further enhance Gaussian attributes by incorporating PBR properties step by step. We generate roughness via a multi-view-conditioned diffusion model, and then derive AO and base color using a 2D-to-3D strategy, incorporating a tailored Gaussian-based ray tracer for efficient visibility computation. Once recovered, these dynamic, relightable assets integrate seamlessly into traditional CG pipelines, supporting real-time rendering with deferred shading and offline rendering with ray tracing. By offering realistic, lifelike visualizations under diverse lighting conditions, BEAM opens new possibilities for interactive entertainment, storytelling, and creative visualization.
中文:BEAM是一种创新流程,将4D高斯表示与基于物理的渲染相结合,从多视角素材生成高质量、可重照明的立体视频,为虚拟现实和互动娱乐等应用提供多样化光照下的逼真可视化效果。
English: BEAM is a novel pipeline that integrates 4D Gaussian representations with physically-based rendering to create high-quality, relightable volumetric videos from multi-view footage, enabling realistic visualizations under various lighting conditions for applications like virtual reality and interactive entertainment.

Authors:Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao
Title: Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT
Abstract:
Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.
中文: Lumina-Video 是一个基于 Next-DiT 的新型视频生成框架,通过多尺度架构、运动条件控制和渐进式训练,实现了高质量且动态可控的视频生成,并配合 Lumina-V2A 模型生成同步音频。
English: Lumina-Video is a novel framework that adapts the Next-DiT architecture for efficient video generation, incorporating multi-scale patch learning, motion conditioning, and progressive training to achieve high-quality, dynamically controllable videos with synchronized audio via Lumina-V2A.

Authors:Pengyu Long, Zijun Zhao, Min Ouyang, Qingcheng Zhao, Qixuan Zhang, Wei Yang, Lan Xu, Jingyi Yu
Title: TANGLED: Generating 3D Hair Strands from Images with Arbitrary Styles and Viewpoints
Abstract:
Hairstyles are intricate and culturally significant with various geometries, textures, and structures. Existing text or image-guided generation methods fail to handle the richness and complexity of diverse styles. We present TANGLED, a novel approach for 3D hair strand generation that accommodates diverse image inputs across styles, viewpoints, and quantities of input views. TANGLED employs a three-step pipeline. First, our MultiHair Dataset provides 457 diverse hairstyles annotated with 74 attributes, emphasizing complex and culturally significant styles to improve model generalization. Second, we propose a diffusion framework conditioned on multi-view linearts that can capture topological cues (e.g., strand density and parting lines) while filtering out noise. By leveraging a latent diffusion model with cross-attention on lineart features, our method achieves flexible and robust 3D hair generation across diverse input conditions. Third, a parametric post-processing module enforces braid-specific constraints to maintain coherence in complex structures. This framework not only advances hairstyle realism and diversity but also enables culturally inclusive digital avatars and novel applications like sketch-based 3D strand editing for animation and augmented reality.
中文摘要:TANGLED提出了一种三步流程,利用扩散框架和多视角线稿生成多样化三维发型,提升了真实感并实现了文化包容性数字形象。
English Summary: TANGLED introduces a three-step pipeline using a diffusion framework and multi-view linearts to generate diverse 3D hairstyles, advancing realism and enabling culturally inclusive digital avatars.

Authors:Lingbing Guo, Yichi Zhang, Zhongpu Bo, Zhuo Chen, Mengshu Sun, Zhiqiang Zhang, Wen Zhang, Huajun Chen
Title: K-ON: Stacking Knowledge On the Head Layer of Large Language Model
Abstract:
Recent advancements in large language models (LLMs) have significantly improved various natural language processing (NLP) tasks. Typically, LLMs are trained to predict the next token, aligning well with many NLP tasks. However, in knowledge graph (KG) scenarios, entities are the fundamental units and identifying an entity requires at least several tokens. This leads to a granularity mismatch between KGs and natural languages. To address this issue, we propose K-ON, which integrates KG knowledge into the LLM by employing multiple head layers for next k-step prediction. K-ON can not only generate entity-level results in one step, but also enables contrastive loss against entities, which is the most powerful tool in KG representation learning. Experimental results show that K-ON outperforms state-of-the-art methods that incorporate text and even the other modalities.
中文: K-ON方法通过多头部层进行k步预测,将知识图谱知识融入大语言模型,解决了其与自然语言的粒度不匹配问题,并在实验中超越了现有最佳多模态方法。
English: The proposed K-ON method addresses the granularity mismatch between knowledge graphs and natural language by integrating KG knowledge into large language models through multi-head layers for next k-step prediction, achieving superior performance over state-of-the-art multimodal methods.

Authors:Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo, Qifeng Liu, Yike Guo
Title: VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer
Abstract:
Crafting magic and illusions is one of the most thrilling aspects of filmmaking, with visual effects (VFX) serving as the powerhouse behind unforgettable cinematic experiences. While recent advances in generative artificial intelligence have driven progress in generic image and video synthesis, the domain of controllable VFX generation remains relatively underexplored. In this work, we propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user-friendly textual descriptions and static reference images. Our work makes two primary contributions: (i) Open-VFX, the first high-quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, instance segmentation masks for spatial conditioning, and start-end timestamps for temporal control. (ii) VFX Creator, a simple yet effective controllable VFX generation framework based on a Video Diffusion Transformer. The model incorporates a spatial and temporal controllable LoRA adapter, requiring minimal training videos. Specifically, a plug-and-play mask control module enables instance-level spatial manipulation, while tokenized start-end motion timestamps embedded in the diffusion process, alongside the text encoder, allow precise temporal control over effect timing and pace. Extensive experiments on the Open-VFX test set demonstrate the superiority of the proposed system in generating realistic and dynamic effects, achieving state-of-the-art performance and generalization ability in both spatial and temporal controllability. Furthermore, we introduce a specialized metric to evaluate the precision of temporal control. By bridging traditional VFX techniques with generative approaches, VFX Creator unlocks new possibilities for efficient and high-quality video effect generation, making advanced VFX accessible to a broader audience.
中文: 本研究提出了一种基于图像动画的视觉特效生成新方法,结合文本描述和静态参考图像,并推出了首个高质量数据集Open-VFX及基于视频扩散变换器的VFX Creator框架,实现了精确的时空控制,显著提升了特效生成的效率与质量。
English: This study introduces a novel approach for generating animated visual effects through image animation, utilizing text descriptions and static references, and presents Open-VFX, a comprehensive dataset, and VFX Creator, an efficient framework based on a Video Diffusion Transformer for precise spatial and temporal control.

Authors:Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue
Title: Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
Abstract:
Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.
中文: 近期基于文本的大语言模型显示,在训练和推理阶段扩展计算资源是有效的,但当前利用大语言模型的TTS系统多为多阶段且复杂;本研究提出Llasa框架,采用单层VQ编解码器和Transformer架构,扩展训练时计算可提升语音自然度和韵律,而推理时计算通过验证器增强情感表现力和准确性。
English: Recent text-based LLMs show that scaling compute during training and inference is effective, but current TTS systems using LLMs are multi-stage and complex; this work introduces Llasa, a simple framework with a single-layer VQ codec and Transformer, where scaling train-time compute improves speech naturalness and prosody, and inference-time compute enhances emotional expressiveness and accuracy through verifiers.

Authors:Junkun Jiang, Jie Chen, Ho Yin Au, Mingyuan Chen, Wei Xue, Yike Guo
Title: Every Angle Is Worth A Second Glance: Mining Kinematic Skeletal Structures from Multi-view Joint Cloud
Abstract:
Multi-person motion capture over sparse angular observations is a challenging problem under interference from both self- and mutual-occlusions. Existing works produce accurate 2D joint detection, however, when these are triangulated and lifted into 3D, available solutions all struggle in selecting the most accurate candidates and associating them to the correct joint type and target identity. As such, in order to fully utilize all accurate 2D joint location information, we propose to independently triangulate between all same-typed 2D joints from all camera views regardless of their target ID, forming the Joint Cloud. Joint Cloud consist of both valid joints lifted from the same joint type and target ID, as well as falsely constructed ones that are from different 2D sources. These redundant and inaccurate candidates are processed over the proposed Joint Cloud Selection and Aggregation Transformer (JCSAT) involving three cascaded encoders which deeply explore the trajectile, skeletal structural, and view-dependent correlations among all 3D point candidates in the cross-embedding space. An Optimal Token Attention Path (OTAP) module is proposed which subsequently selects and aggregates informative features from these redundant observations for the final prediction of human motion. To demonstrate the effectiveness of JCSAT, we build and publish a new multi-person motion capture dataset BUMocap-X with complex interactions and severe occlusions. Comprehensive experiments over the newly presented as well as benchmark datasets validate the effectiveness of the proposed framework, which outperforms all existing state-of-the-art methods, especially under challenging occlusion scenarios.
中文: 提出的JCSAT框架通过将所有同类2D关节点三角化为关节云,并利用级联编码器和OTAP模块处理冗余观测,有效解决了多人运动捕捉中的遮挡难题,在复杂遮挡场景下显著优于现有方法。
English: The proposed JCSAT framework addresses multi-person motion capture challenges by triangulating all same-typed 2D joints into a Joint Cloud and processing them through cascaded encoders with an OTAP module, outperforming existing methods especially in occlusion scenarios.

Authors:Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng
Title: Pushing up to the Limit of Memory Bandwidth and Capacity Utilization for Efficient LLM Decoding on Embedded FPGA
Abstract:
The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.
中文摘要:本研究首次在独立嵌入式FPGA设备上部署了70亿参数大语言模型,通过定制数据流和裸机系统实现了接近理论极限的解码性能,为嵌入式设备的高效推理提供了关键方案。
English Summary: This paper presents a hardware accelerator that successfully deploys a 7B-parameter LLM on a memory-constrained edge device, achieving near-optimal performance through customized dataflow and bare-metal implementation.

Authors:Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, Shanghang Zhang
Title: RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Abstract:
Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.
中文: 当前多模态大语言模型在机器人操作任务中存在规划、功能感知和轨迹预测能力不足的局限,但通过ShareRobot数据集和RoboBrain模型的开发,显著提升了这些核心机器人能力并实现了最先进的性能表现。
English: Recent MLLMs show limitations in robotic manipulation tasks due to lacking planning, affordance perception, and trajectory prediction capabilities, but the introduction of ShareRobot dataset and RoboBrain model achieves state-of-the-art performance by enhancing these core robotic functions.

Authors:Yifan Duan, Yihong Tang, Xuefeng Bai, Kehai Chen, Juntao Li, Min Zhang
Title: The Power of Personality: A Human Simulation Perspective to Investigate Large Language Model Agents
Abstract:
Large language models (LLMs) excel in both closed tasks (including problem-solving, and code generation) and open tasks (including creative writing), yet existing explanations for their capabilities lack connections to real-world human intelligence. To fill this gap, this paper systematically investigates LLM intelligence through the lens of ``human simulation'', addressing three core questions: (1) \textit{How do personality traits affect problem-solving in closed tasks?} (2) \textit{How do traits shape creativity in open tasks?} (3) \textit{How does single-agent performance influence multi-agent collaboration?} By assigning Big Five personality traits to LLM agents and evaluating their performance in single- and multi-agent settings, we reveal that specific traits significantly influence reasoning accuracy (closed tasks) and creative output (open tasks). Furthermore, multi-agent systems exhibit collective intelligence distinct from individual capabilities, driven by distinguishing combinations of personalities.
中文: 本文通过人类模拟系统研究大语言模型智能,发现特定人格特质显著影响封闭任务中的推理准确性和开放任务中的创造力,同时多智能体系统因个性组合展现出区别于个体的集体智能。
English: This paper systematically explores large language model intelligence through human simulation, revealing how specific personality traits influence reasoning accuracy in closed tasks and creativity in open tasks, while multi-agent systems demonstrate unique collective intelligence shaped by personality combinations.

Authors:Yihong Tang, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Bo Wang, Jie Liu, Min Zhang
Title: The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Abstract:
Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model's ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.
中文:本研究提出了一种自适应动态多偏好方法,通过基于风险场景动态调整偏好,帮助角色扮演对话代理在保持角色真实性的同时确保内容安全,从而在不牺牲实用性的前提下提升安全性。
English: The study introduces an Adaptive Dynamic Multi-Preference method to help role-playing dialogue agents balance character authenticity with content safety by dynamically adjusting preferences based on risk scenarios, improving safety without compromising utility.

Authors:Yuan Li, Cheng Lin, Yuan Liu, Xiaoxiao Long, Chenxu Zhang, Ningna Wang, Xin Li, Wenping Wang, Xiaohu Guo
Title: CADDreamer: CAD Object Generation from Single-view Images
Abstract:
Diffusion-based 3D generation has made remarkable progress in recent years. However, existing 3D generative models often produce overly dense and unstructured meshes, which stand in stark contrast to the compact, structured, and sharply-edged Computer-Aided Design (CAD) models crafted by human designers. To address this gap, we introduce CADDreamer, a novel approach for generating boundary representations (B-rep) of CAD objects from a single image. CADDreamer employs a primitive-aware multi-view diffusion model that captures both local geometric details and high-level structural semantics during the generation process. By encoding primitive semantics into the color domain, the method leverages the strong priors of pre-trained diffusion models to align with well-defined primitives. This enables the inference of multi-view normal maps and semantic maps from a single image, facilitating the reconstruction of a mesh with primitive labels. Furthermore, we introduce geometric optimization techniques and topology-preserving extraction methods to mitigate noise and distortion in the generated primitives. These enhancements result in a complete and seamless B-rep of the CAD model. Experimental results demonstrate that our method effectively recovers high-quality CAD objects from single-view images. Compared to existing 3D generation techniques, the B-rep models produced by CADDreamer are compact in representation, clear in structure, sharp in edges, and watertight in topology.
中文:CADDreamer是一种创新方法,通过融合多视角扩散与基元语义从单张图像生成结构化、紧凑的边界表示CAD模型,其输出的模型具有拓扑密封和锐利边缘特性,显著优于现有三维生成技术。
English: CADDreamer is a novel method that generates structured and compact boundary representation CAD models from single images by leveraging multi-view diffusion with primitive semantics, producing watertight and sharply-edged results superior to existing 3D generation techniques.

Authors:Ke Niu, Haiyang Yu, Mengyang Zhao, Teng Fu, Siyang Yi, Wei Lu, Bin Li, Xuelin Qian, Xiangyang Xue
Title: ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models
Abstract:
Person re-identification (Re-ID) is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views. While recent advanced vision-language models (VLMs) excel in logical reasoning and multi-task generalization, their applications in Re-ID tasks remain limited. They either struggle to perform accurate matching based on identity-relevant features or assist image-dominated branches as auxiliary semantics. In this paper, we propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification. To integrate the reasoning abilities of language models into Re-ID pipelines, We first present a large-scale instruction dataset, which contains more than 8 million prompts to promote the model fine-tuning. Next. we introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning. Extensive experiments across ten popular benchmarks demonstrate that ChatReID outperforms existing methods, achieving state-of-the-art performance in all Re-ID tasks. More experiments demonstrate that ChatReID not only has the ability to recognize fine-grained details but also to integrate them into a coherent reasoning process.
中文: 本文提出ChatReID框架,通过采用文本主导的检索范式及分层渐进式调优策略,将语言模型的推理能力融入行人重识别任务,在多个基准测试中实现了最优性能。
English: This paper introduces ChatReID, a novel framework that leverages a text-side-dominated retrieval paradigm and hierarchical progressive tuning to enhance person re-identification, achieving state-of-the-art results across multiple benchmarks by integrating reasoning capabilities from vision-language models.

Authors:Zihan Wang, Ziqi Zhao, Yougang Lyu, Zhumin Chen, Maarten de Rijke, Zhaochun Ren
Title: A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition
Abstract:
Zero-shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora. This task presents substantial challenges due to minimal human intervention. Recent work has adapted large language models (LLMs) for zero-shot NER by crafting specialized prompt templates. It advances model self-learning abilities by incorporating self-annotated demonstrations. However, two important challenges persist: (i) Correlations between contexts surrounding entities are overlooked, leading to wrong type predictions or entity omissions. (ii) The indiscriminate use of task demonstrations, retrieved through shallow similarity-based strategies, severely misleads LLMs during inference. In this paper, we introduce the cooperative multi-agent system (CMAS), a novel framework for zero-shot NER that uses the collective intelligence of multiple agents to address the challenges outlined above. CMAS has four main agents: (i) a self-annotator, (ii) a type-related feature (TRF) extractor, (iii) a demonstration discriminator, and (iv) an overall predictor. To explicitly capture correlations between contexts surrounding entities, CMAS reformulates NER into two subtasks: recognizing named entities and identifying entity type-related features within the target sentence. To enable controllable utilization of demonstrations, a demonstration discriminator is established to incorporate the self-reflection mechanism, automatically evaluating helpfulness scores for the target sentence. Experimental results show that CMAS significantly improves zero-shot NER performance across six benchmarks, including both domain-specific and general-domain scenarios. Furthermore, CMAS demonstrates its effectiveness in few-shot settings and with various LLM backbones.
中文: 协作多智能体系统(CMAS)通过多个专业代理解决上下文关联问题并优化示例选择,显著提升了零样本命名实体识别的性能,在多种基准测试中表现优异。
English: The cooperative multi-agent system (CMAS) enhances zero-shot named entity recognition by addressing context correlation issues and improving demonstration selection through multiple specialized agents, achieving superior performance across diverse benchmarks.

Authors:Zhiyu Yin, Kehai Chen, Xuefeng Bai, Ruili Jiang, Juntao Li, Hongdong Li, Jin Liu, Yang Xiang, Jun Yu, Min Zhang
Title: ASurvey: Spatiotemporal Consistency in Video Generation
Abstract:
Video generation, by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC). Video generation presents unique challenges beyond static image generation, requiring both high-quality individual frames and temporal coherence to maintain consistency across the spatiotemporal sequence. Recent works have aimed at addressing the spatiotemporal consistency issue in video generation, while few literature review has been organized from this perspective. This gap hinders a deeper understanding of the underlying mechanisms for high-quality video generation. In this survey, we systematically review the recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics. We particularly focus on their contributions to maintaining spatiotemporal consistency. Finally, we discuss the future directions and challenges in this field, hoping to inspire further efforts to advance the development of video generation.
中文摘要:本综述系统回顾了视频生成领域的最新进展,特别聚焦于不同方法如何通过五大技术维度解决保持时空一致性的核心挑战。
English Summary: This survey systematically reviews recent advances in video generation, focusing on how different approaches address the critical challenge of maintaining spatiotemporal consistency across five key technical dimensions.

Authors:Jiaxi Li, Xingxing Zhang, Xun Wang, Xiaolong Huang, Li Dong, Liang Wang, Si-Qing Chen, Wei Lu, Furu Wei
Title: WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale
Abstract:
Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data synthesis methods focus narrowly on objectives like fact retrieval and summarization, restricting their generalizability to complex, real-world tasks. WildLong extracts meta-information from real user queries, models co-occurrence relationships via graph-based methods, and employs adaptive generation to produce scalable data. It extends beyond single-document tasks to support multi-document reasoning, such as cross-document comparison and aggregation. Our models, finetuned on 150K instruction-response pairs synthesized using WildLong, surpasses existing open-source long-context-optimized models across benchmarks while maintaining strong performance on short-context tasks without incorporating supplementary short-context data. By generating a more diverse and realistic long-context instruction dataset, WildLong enhances LLMs' ability to generalize to complex, real-world reasoning over long contexts, establishing a new paradigm for long-context data synthesis.
Chinese: WildLong通过提取真实用户查询的元信息、采用图建模方法并自适应生成数据,创新性地合成了多样化的长上下文指令数据集,使大语言模型在复杂多文档推理任务中表现卓越,同时保持短上下文任务的强劲性能。
English: WildLong introduces a novel method for synthesizing diverse and realistic long-context instruction data by leveraging real user queries and graph-based modeling, enabling LLMs to excel in complex, multi-document reasoning tasks without compromising short-context performance.

Authors:Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Yang Xiang, Min Zhang
Title: Exploring Translation Mechanism of Large Language Models
Abstract:
Large language models (LLMs) have succeeded remarkably in multilingual translation tasks. However, the inherent translation mechanisms of LLMs remain poorly understood, largely due to sophisticated architectures and vast parameter scales. In response to this issue, this study explores the translation mechanism of LLM from the perspective of computational components (e.g., attention heads and MLPs). Path patching is utilized to explore causal relationships between components, detecting those crucial for translation tasks and subsequently analyzing their behavioral patterns in human-interpretable terms. Comprehensive analysis reveals that translation is predominantly facilitated by a sparse subset of specialized attention heads (less than 5\%), which extract source language, indicator, and positional features. MLPs subsequently integrate and process these features by transiting towards English-centric latent representations. Notably, building on the above findings, targeted fine-tuning of only 64 heads achieves translation improvement comparable to full-parameter tuning while preserving general capabilities.
中文: 大语言模型通过少量专用注意力头提取源语言特征,结合多层感知机将其转化为以英语为核心的潜在表征,仅针对性微调64个注意力头即可达到全参数优化的翻译效果并保持通用能力。
English: Large language models achieve multilingual translation primarily through a small subset of specialized attention heads that extract key linguistic features, with MLPs integrating them into English-centric representations, enabling targeted fine-tuning of just 64 heads to match full-parameter tuning effectiveness.

Authors:Yankai Fu, Qiuxuan Feng, Ning Chen, Zichen Zhou, Mengzhen Liu, Mingdong Wu, Tianxing Chen, Shanyu Rong, Jiaming Liu, Hao Dong, Shanghang Zhang
Title: CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World
Abstract:
Achieving human-level dexterity in robots is a key objective in the field of robotic manipulation. Recent advancements in 3D-based imitation learning have shown promising results, providing an effective pathway to achieve this goal. However, obtaining high-quality 3D representations presents two key problems: (1) the quality of point clouds captured by a single-view camera is significantly affected by factors such as camera resolution, positioning, and occlusions caused by the dexterous hand; (2) the global point clouds lack crucial contact information and spatial correspondences, which are necessary for fine-grained dexterous manipulation tasks. To eliminate these limitations, we propose CordViP, a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception. Specifically, we first introduce the interaction-aware point clouds, which establish correspondences between the object and the hand. These point clouds are then used for our pre-training policy, where we also incorporate object-centric contact maps and hand-arm coordination information, effectively capturing both spatial and temporal dynamics. Our method demonstrates exceptional dexterous manipulation capabilities, achieving state-of-the-art performance in six real-world tasks, surpassing other baselines by a large margin. Experimental results also highlight the superior generalization and robustness of CordViP to different objects, viewpoints, and scenarios. Code and videos are available on https://aureleopku.github.io/CordViP.
中文摘要:CordViP框架通过6D姿态估计和本体感知建立物体与手的对应关系,克服了三维表示质量对灵巧操作的制约,在多种现实任务中实现了最先进的机器人灵巧操控性能。
English Summary: The CordViP framework overcomes limitations in 3D representation quality for robotic manipulation by establishing object-hand correspondences through 6D pose estimation and proprioception, achieving state-of-the-art dexterous manipulation performance across diverse real-world tasks.

Authors:Zilong Wang, Zhiyang Dou, Yuan Liu, Cheng Lin, Xiao Dong, Yunhui Guo, Chenxu Zhang, Xin Li, Wenping Wang, Xiaohu Guo
Title: WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction
Abstract:
In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at https://wyiguanw.github.io/WonderHuman/.
中文: WonderHuman通过利用2D生成扩散模型先验和双空间优化技术,从单目视频中重建动态人体化身,精确渲染未观察到的身体部位,实现了最先进的逼真效果。
English: WonderHuman reconstructs dynamic human avatars from monocular videos using 2D generative diffusion priors and dual-space optimization to accurately render unseen body parts, achieving state-of-the-art photorealistic results.

Authors:Zilong Wang, Zhiyang Dou, Yuan Liu, Cheng Lin, Xiao Dong, Yunhui Guo, Chenxu Zhang, Xin Li, Wenping Wang, Xiaohu Guo
Title: WonderHuman: Hallucinating Unseen Parts in Dynamic 3D Human Reconstruction
Abstract:
In this paper, we present WonderHuman to reconstruct dynamic human avatars from a monocular video for high-fidelity novel view synthesis. Previous dynamic human avatar reconstruction methods typically require the input video to have full coverage of the observed human body. However, in daily practice, one typically has access to limited viewpoints, such as monocular front-view videos, making it a cumbersome task for previous methods to reconstruct the unseen parts of the human avatar. To tackle the issue, we present WonderHuman, which leverages 2D generative diffusion model priors to achieve high-quality, photorealistic reconstructions of dynamic human avatars from monocular videos, including accurate rendering of unseen body parts. Our approach introduces a Dual-Space Optimization technique, applying Score Distillation Sampling (SDS) in both canonical and observation spaces to ensure visual consistency and enhance realism in dynamic human reconstruction. Additionally, we present a View Selection strategy and Pose Feature Injection to enforce the consistency between SDS predictions and observed data, ensuring pose-dependent effects and higher fidelity in the reconstructed avatar. In the experiments, our method achieves SOTA performance in producing photorealistic renderings from the given monocular video, particularly for those challenging unseen parts. The project page and source code can be found at https://wyiguanw.github.io/WonderHuman/.
中文: WonderHuman通过利用2D生成扩散模型先验和双空间优化技术,从单目视频中重建动态人体化身,精确渲染未观察到的身体部位,实现了最先进的逼真效果。
English: WonderHuman reconstructs dynamic human avatars from monocular videos using 2D generative diffusion priors and dual-space optimization to accurately render unseen body parts, achieving state-of-the-art photorealistic results.

Authors:Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, Yao Lu
Title: WorldModelBench: Judging Video Generation Models As World Models
Abstract:
Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at https://worldmodelbench-team.github.io.
中文摘要:WorldModelBench作为新型基准测试,通过人类标注数据和超越GPT-4o的自动评判器,严格评估视频生成模型的世界建模能力,重点关注物理规律遵循与指令执行维度。
English Summary: WorldModelBench is introduced as a new benchmark to rigorously evaluate video generation models' world modeling capabilities, focusing on physics adherence and instruction-following through human-annotated data and an automated judger that outperforms GPT-4o.

Authors:Xidong Mu, Zhaolin Wang, Yuanwei Liu
Title: Simultaneously Transmitting And Reflecting Surfaces (STARS) for Multi-Functional 6G
Abstract:
Simultaneously transmitting and reflecting surface (STARS) empowered multi-functional 6G wireless networks are investigated. Starting with the communication functionality, various types of STARS are introduced in terms of power amplification capabilities, reciprocity features, and spatial density of elements. Then, three STARS-empowered wireless sensing architectures are proposed, namely STARS-aided monostatic sensing, STARS-enabled bistatic sensing, and sensing with target-mounted STARS, where the representative benefits and application challenges are identified. Furthermore, promising applications of STARS for computing and caching functionalities are explored to improve the computation efficiency and reduce the content delivery latency. Finally, recent standardization progress for reconfigurable intelligent surfaces is presented for motivating the employment of STARS in multi-functional 6G.
Chinese: 本文摘要探讨了同时传输与反射表面(STARS)在6G网络中的应用,阐述了其在通信、感知、计算和缓存方面的功能,旨在提升网络性能与效率。
English: This abstract explores the use of Simultaneously Transmitting and Reflecting Surfaces (STARS) in 6G networks, detailing their communication, sensing, computing, and caching functionalities to enhance performance and efficiency.

Authors:Xidong Mu, Guangyu Zhu, Yuanwei Liu
Title: Pinching-Antenna System (PASS)-enabled Multicast Communications
Abstract:
Pinching-antenna system (PASS) is a novel flexible-antenna technology, which employs long-spread waveguides to convey signals with negligible path loss and pinching antennas (PAs) with adjustable positions to radiate signals from the waveguide into the free space. Therefore, short-distance and strong line-of-sight transmission can be established. In this paper, a novel PASS-enabled multicast communication framework is proposed, where multiple PAs on a single waveguide radiate the broadcast signals to multiple users. The multicast performance maximization problem is formulated to optimize the positions of all PAs. To address this non-convex problem, a particle swarm optimization-based algorithm is developed. Numerical results show that PASS can significantly outperform the conventional multiple-antenna transmission.
中文摘要:本文提出了一种新型柔性天线系统PASS,通过优化波导上的夹持天线位置实现高效多播通信,采用粒子群优化算法解决非凸问题,数值结果表明其性能显著优于传统多天线传输。
English Summary: The paper introduces a flexible-antenna system called PASS that uses adjustable pinching antennas on waveguides to enable efficient multicast communication, with optimized antenna positioning via particle swarm algorithms showing superior performance over traditional multi-antenna systems.

Authors:Leila Arras, Bruno Puri, Patrick Kahardipraja, Sebastian Lapuschkin, Wojciech Samek
Title: A Close Look at Decomposition-based XAI-Methods for Transformer Language Models
Abstract:
Various XAI attribution methods have been recently proposed for the transformer architecture, allowing for insights into the decision-making process of large language models by assigning importance scores to input tokens and intermediate representations. One class of methods that seems very promising in this direction includes decomposition-based approaches, i.e., XAI-methods that redistribute the model's prediction logit through the network, as this value is directly related to the prediction. In the previous literature we note though that two prominent methods of this category, namely ALTI-Logit and LRP, have not yet been analyzed in juxtaposition and hence we propose to close this gap by conducting a careful quantitative evaluation w.r.t. ground truth annotations on a subject-verb agreement task, as well as various qualitative inspections, using BERT, GPT-2 and LLaMA-3 as a testbed. Along the way we compare and extend the ALTI-Logit and LRP methods, including the recently proposed AttnLRP variant, from an algorithmic and implementation perspective. We further incorporate in our benchmark two widely-used gradient-based attribution techniques. Finally, we make our carefullly constructed benchmark dataset for evaluating attributions on language models, as well as our code, publicly available in order to foster evaluation of XAI-methods on a well-defined common ground.
中文: 本研究在主语-动词一致性任务上对基于分解的XAI方法(特别是ALTI-Logit和LRP)进行了BERT、GPT-2和LLaMA-3模型的对比分析,同时扩展了算法实现并公开基准数据集以推动标准化评估。
English: This study conducts a comparative analysis of decomposition-based XAI methods, particularly ALTI-Logit and LRP, using BERT, GPT-2, and LLaMA-3 models on a subject-verb agreement task, while also extending algorithmic implementations and releasing benchmark datasets to standardize evaluation.

Authors:Pinzheng Wang, Zecheng Tang, Keyan Zhou, Juntao Li, Qiaoming Zhu, Min Zhang
Title: Revealing and Mitigating Over-Attention in Knowledge Editing
Abstract:
Large Language Models have demonstrated superior performance across a wide range of tasks, but they still exhibit undesirable errors due to incorrect knowledge learned from the training data. To avoid this, knowledge editing methods emerged to precisely edit the specific model knowledge via efficiently modifying a very small percentage of parameters. % However, those methods can lead to the problem of Specificity Failure: when the content related to the edited knowledge occurs in the context, it can inadvertently corrupt other pre-existing knowledge. However, those methods can lead to the problem of Specificity Failure, where the existing knowledge and capabilities are severely degraded due to editing. Our preliminary indicates that Specificity Failure primarily stems from the model's attention heads assigning excessive attention scores to entities related to the edited knowledge, thereby unduly focusing on specific snippets within the context, which we denote as the Attention Drift phenomenon. To mitigate such Attention Drift issue, we introduce a simple yet effective method Selective Attention Drift Restriction}(SADR), which introduces an additional regularization term during the knowledge editing process to restrict changes in the attention weight distribution, thereby preventing undue focus on the edited entity. Experiments on five frequently used strong LLMs demonstrate the effectiveness of our method, where SADR can significantly mitigate Specificity Failure in the predominant knowledge editing tasks.
中文: 大语言模型在知识编辑中常出现特异性失效问题,即注意力漂移导致现有知识受损,而提出的SADR方法通过限制注意力权重变化有效缓解了这一问题。
English: Large Language Models often suffer from Specificity Failure during knowledge editing, where attention drift causes unintended degradation of existing knowledge, but the proposed SADR method effectively mitigates this issue by restricting attention weight changes.

Authors:Michael Luo, Xiaoxiang Shi, Colin Cai, Tianjun Zhang, Justin Wong, Yichuan Wang, Chi Wang, Yanping Huang, Zhifeng Chen, Joseph E. Gonzalez, Ion Stoica
Title: Autellix: An Efficient Serving Engine for LLM Agents as General Programs
Abstract:
Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs, which scale LLM calls and output tokens to help AI agents reason, explore, and solve complex tasks. However, existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. Our analysis reveals that programs submitted to LLM serving engines experience long cumulative wait times, primarily due to head-of-line blocking at both the individual LLM request and the program. To address this, we introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies. Autellix intercepts LLM calls submitted by programs, enriching schedulers with program-level context. We propose two scheduling algorithms-for single-threaded and distributed programs-that preempt and prioritize LLM calls based on their programs' previously completed calls. Our evaluation demonstrates that across diverse LLMs and agentic workloads, Autellix improves throughput of programs by 4-15x at the same latency compared to state-of-the-art systems, such as vLLM.
中文:大语言模型应用正从简单聊天机器人发展为动态智能代理程序,但现有服务系统忽视程序间依赖关系导致效率低下,Autellix通过将程序作为调度核心单元,实现了吞吐量的大幅提升与延迟降低。
English: Large language model applications are advancing from basic chatbots to dynamic agentic programs, but current serving systems overlook program dependencies, leading to inefficiencies, which Autellix addresses by prioritizing program-level scheduling to significantly boost throughput and reduce latency.

Authors:Wenxiang Guo, Yu Zhang, Changhao Pan, Rongjie Huang, Li Tang, Ruiqi Li, Zhiqing Hong, Yongqi Wang, Zhou Zhao
Title: TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
Abstract:
Singing voice synthesis has made remarkable progress in generating natural and high-quality voices. However, existing methods rarely provide precise control over vocal techniques such as intensity, mixed voice, falsetto, bubble, and breathy tones, thus limiting the expressive potential of synthetic voices. We introduce TechSinger, an advanced system for controllable singing voice synthesis that supports five languages and seven vocal techniques. TechSinger leverages a flow-matching-based generative model to produce singing voices with enhanced expressive control over various techniques. To enhance the diversity of training data, we develop a technique detection model that automatically annotates datasets with phoneme-level technique labels. Additionally, our prompt-based technique prediction model enables users to specify desired vocal attributes through natural language, offering fine-grained control over the synthesized singing. Experimental results demonstrate that TechSinger significantly enhances the expressiveness and realism of synthetic singing voices, outperforming existing methods in terms of audio quality and technique-specific control. Audio samples can be found at https://gwx314.github.io/tech-singer/.
中文: TechSinger是一种先进的歌声合成系统,通过基于流匹配的生成模型和自然语言提示实现对多种演唱技巧的精确控制,在音频质量和技术控制方面显著优于现有方法。
English: TechSinger is an advanced singing voice synthesis system that enables precise control over vocal techniques through a flow-matching generative model and natural language prompts, significantly enhancing expressiveness and audio quality compared to existing methods.

Authors:Congkai Xie, Shuo Cai, Wenjun Wang, Pengxiang Li, Zhijie Sang, Kejing Yang, Yiming Zhang, Zhen Li, Guanghao Zhu, Zeyu Liu, Yang Yu, Yuhang Liu, Su Lu, Baoyi He, Qi Zhou, Xiaotian Han, Jianbo Yuan, Shengyu Zhang, Fei Wu, Hongxia Yang
Title: InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Abstract:
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at https://github. com/Reallm-Labs/InfiR.
中文摘要:本文提出高效的小型语言模型与多模态小型语言模型,通过创新训练流程在保持强大推理能力的同时降低计算成本与隐私风险。
English Summary: This paper introduces efficient small language models (SLMs) and multimodal SLMs that maintain strong reasoning capabilities while reducing computational costs and privacy risks through a novel training pipeline.

Authors:Yixian Wang, Geng Sun, Zemin Sun, Long He, Jiacheng Wang, Shiwen Mao
Title: IRS-assisted Edge Computing for Vehicular Networks: A Generative Diffusion Model-based Stackelberg Game Approach
Abstract:
Recent advancements in intelligent reflecting surfaces (IRS) and mobile edge computing (MEC) offer new opportunities to enhance the performance of vehicular networks. However, meeting the computation-intensive and latency-sensitive demands of vehicles remains challenging due to the energy constraints and dynamic environments. To address this issue, we study an IRS-assisted MEC architecture for vehicular networks. We formulate a multi-objective optimization problem aimed at minimizing the total task completion delay and total energy consumption by jointly optimizing task offloading, IRS phase shift vector, and computation resource allocation. Given the mixed-integer nonlinear programming (MINLP) and NP-hard nature of the problem, we propose a generative diffusion model (GDM)-based Stackelberg game (GDMSG) approach. Specifically, the problem is reformulated within a Stackelberg game framework, where generative GDM is integrated to capture complex dynamics to efficiently derive optimal solutions. Simulation results indicate that the proposed GDMSG achieves outstanding performance compared to the benchmark approaches.
中文摘要:本研究提出了一种基于生成扩散模型的斯塔克伯格博弈方法,用于优化智能反射面辅助车联网移动边缘计算中的任务卸载和资源分配,有效降低了任务延迟和能耗。
English Summary: This study proposes a generative diffusion model-based Stackelberg game approach to optimize task offloading and resource allocation in IRS-assisted mobile edge computing for vehicular networks, effectively reducing delay and energy consumption.

Authors:Zifan Lang, Guixia Liu, Geng Sun, Jiahui Li, Zemin Sun, Jiacheng Wang, Victor C. M. Leung
Title: AoI-Sensitive Data Forwarding with Distributed Beamforming in UAV-Assisted IoT
Abstract:
This paper proposes a UAV-assisted forwarding system based on distributed beamforming to enhance age of information (AoI) in Internet of Things (IoT). Specifically, UAVs collect and relay data between sensor nodes (SNs) and the remote base station (BS). However, flight delays increase the AoI and degrade the network performance. To mitigate this, we adopt distributed beamforming to extend the communication range, reduce the flight frequency and ensure the continuous data relay and efficient energy utilization. Then, we formulate an optimization problem to minimize AoI and UAV energy consumption, by jointly optimizing the UAV trajectories and communication schedules. The problem is non-convex and with high dynamic, and thus we propose a deep reinforcement learning (DRL)-based algorithm to solve the problem, thereby enhancing the stability and accelerate convergence speed. Simulation results show that the proposed algorithm effectively addresses the problem and outperforms other benchmark algorithms.
中文摘要:本文提出了一种基于分布式波束成形的无人机辅助转发系统,通过深度强化学习联合优化无人机轨迹和通信调度,有效降低了信息年龄与能耗,仿真结果表明其性能优于现有基准算法。
English Summary: This paper introduces a UAV-assisted system using distributed beamforming and deep reinforcement learning to optimize Age of Information and energy consumption by coordinating UAV trajectories and communication schedules, demonstrating superior performance over existing methods.

Authors:Xiaoxia Xu, Xidong Mu, Yuanwei Liu, Arumugam Nallanathan
Title: Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based?
Abstract:
A novel pinching antenna system (PASS)-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed. PASS consists of multiple waveguides spanning over thousands of wavelength, which equip numerous low-cost dielectric particles, named pinching antennas (PAs), to radiate signals into free space. The positions of PAs can be reconfigured to change both the large-scale path losses and phases of signals, thus facilitating the novel pinching beamforming design. A sum rate maximization problem is formulated, which jointly optimizes the transmit and pinching beamforming to adaptively achieve constructive signal enhancement and destructive interference mitigation. To solve this highly coupled and nonconvex problem, both optimization-based and learning-based methods are proposed. 1) For the optimization-based method, a majorization-minimization and penalty dual decomposition (MM-PDD) algorithm is developed, which handles the nonconvex complex exponential component using a Lipschitz surrogate function and then invokes PDD for problem decoupling. 2) For the learning-based method, a novel Karush-Kuhn-Tucker (KKT)-guided dual learning (KDL) approach is proposed, which enables KKT solutions to be reconstructed in a data-driven manner by learning dual variables. Following this idea, a KDL-Tranformer algorithm is developed, which captures both inter-PA/inter-user dependencies and channel-state-information (CSI)-beamforming dependencies by attention mechanisms. Simulation results demonstrate that: i) The proposed PASS framework significantly outperforms conventional massive multiple input multiple output (MIMO) system even with a few PAs. ii) The proposed KDL-Transformer can improve over 30% system performance than MM-PDD algorithm, while achieving a millisecond-level response on modern GPUs.
中文摘要:本文提出了一种新型的夹持天线系统(PASS),通过可重构的介质粒子实现动态波束成形,结合优化的MM-PDD算法和基于学习的KDL-Transformer方法,在显著超越传统大规模MIMO系统的同时,实现了毫秒级响应和超过30%的性能提升。
English Summary: A novel pinching antenna system (PASS) is proposed for downlink multi-user MISO communication, featuring reconfigurable dielectric particles that enable dynamic beamforming through both optimization-based MM-PDD and learning-based KDL-Transformer algorithms, significantly outperforming conventional massive MIMO systems.

Authors:Yaoxin Yang, Peng Ye, Weihao Lin, Kangcong Li, Yan Wen, Jia Hao, Tao Chen
Title: Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures
Abstract:
Heterogeneous distillation is an effective way to transfer knowledge from cross-architecture teacher models to student models. However, existing heterogeneous distillation methods do not take full advantage of the dark knowledge hidden in the teacher's output, limiting their performance.To this end, we propose a novel framework named Multi-Level Decoupled Relational Knowledge Distillation (MLDR-KD) to unleash the potential of relational distillation in heterogeneous distillation. Concretely, we first introduce Decoupled Finegrained Relation Alignment (DFRA) in both logit and feature levels to balance the trade-off between distilled dark knowledge and the confidence in the correct category of the heterogeneous teacher model. Then, Multi-Scale Dynamic Fusion (MSDF) module is applied to dynamically fuse the projected logits of multiscale features at different stages in student model, further improving performance of our method in feature level. We verify our method on four architectures (CNNs, Transformers, MLPs and Mambas), two datasets (CIFAR-100 and Tiny-ImageNet). Compared with the best available method, our MLDR-KD improves student model performance with gains of up to 4.86% on CIFAR-100 and 2.78% on Tiny-ImageNet datasets respectively, showing robustness and generality in heterogeneous distillation. Code will be released soon.
中文摘要:本文提出MLDR-KD新框架,通过多级解耦关系对齐和动态特征融合提升异构知识蒸馏效果,在基准数据集上最高实现4.86%的性能提升。
English Summary: This paper introduces MLDR-KD, a novel framework that enhances heterogeneous knowledge distillation through multi-level relational alignment and dynamic feature fusion, achieving performance gains of up to 4.86% on benchmark datasets.

Authors:Zhaolin Wang, Chongjun Ouyang, Xidong Mu, Yuanwei Liu, Zhiguo Ding
Title: Modeling and Beamforming Optimization for Pinching-Antenna Systems
Abstract:
The Pinching-Antenna SyStem (PASS) is a revolutionary flexible antenna technology designed to enhance wireless communication by establishing strong line-of-sight (LoS) links, reducing free-space path loss and enabling antenna array reconfigurability. PASS uses dielectric waveguides with low propagation loss for signal transmission, radiating via a passive pinching antenna, which is a small dielectric element applied to the waveguide. This paper first proposes a physics-based hardware model for PASS, where the pinching antenna is modeled as an open-ended directional coupler, and the electromagnetic field behavior is analyzed using coupled-mode theory. A simplified signal model characterizes the coupling effect between multiple antennas on the same waveguide. Based on this, two power models are proposed: equal power and proportional power models. Additionally, a transmit power minimization problem is formulated/studied for the joint optimization of transmit and pinching beamforming under both continuous and discrete pinching antenna activations. Two algorithms are proposed to solve this multimodal optimization problem: the penalty-based alternating optimization algorithm and a low-complexity zero-forcing (ZF)-based algorithm. Numerical results show that 1) the ZF-based low-complexity algorithm performs similarly to the penalty-based algorithm, 2) PASS reduces transmit power by over 95% compared to conventional and massive MIMO, 3) discrete activation causes minimal performance loss but requires a dense antenna set to match continuous activation, and 4) the proportional power model yields performance comparable to the equal power model.
Chinese: Pinching-Antenna系统(PASS)是一种革命性的柔性天线技术,通过优化视距链路和波束赋形,将发射功率较传统方法降低95%以上,其提出的两种算法和功率模型在连续与离散天线激活下均展现出优异性能。
English: The Pinching-Antenna System (PASS) is a flexible antenna technology that enhances wireless communication by optimizing line-of-sight links and minimizing transmit power, achieving over 95% reduction compared to conventional methods through innovative beamforming algorithms and power models.

Authors:Tongtong Feng, Xin Wang, Zekai Zhou, Ren Wang, Yuwei Zhan, Guangyao Li, Qing Li, Wenwu Zhu
Title: EvoAgent: Agent Autonomous Evolution with Continual World Model for Long-Horizon Tasks
Abstract:
Completing Long-Horizon (LH) tasks in open-ended worlds is an important yet difficult problem for embodied agents. Existing approaches suffer from two key challenges: (1) they heavily rely on experiences obtained from human-created data or curricula, lacking the ability to continuously update multimodal experiences, and (2) they may encounter catastrophic forgetting issues when faced with new tasks, lacking the ability to continuously update world knowledge. To solve these challenges, this paper presents EvoAgent, an autonomous-evolving agent with a continual World Model (WM), which can autonomously complete various LH tasks across environments through self-planning, self-control, and self-reflection, without human intervention. Our proposed EvoAgent contains three modules, i.e., i) the memory-driven planner which uses an LLM along with the WM and interaction memory, to convert LH tasks into executable sub-tasks; ii) the WM-guided action controller which leverages WM to generate low-level actions and incorporates a self-verification mechanism to update multimodal experiences; iii) the experience-inspired reflector which implements a two-stage curriculum learning algorithm to select experiences for task-adaptive WM updates. Moreover, we develop a continual World Model for EvoAgent, which can continuously update the multimodal experience pool and world knowledge through closed-loop dynamics. We conducted extensive experiments on Minecraft, compared with existing methods, EvoAgent can achieve an average success rate improvement of 105% and reduce ineffective actions by more than 6x.
中文: EvoAgent是一种具备持续世界模型的自进化具身智能体,通过自我规划、控制和反思自主完成长周期任务,在无需人工干预的情况下大幅提升了任务成功率与行动效率。
English: EvoAgent is a self-evolving embodied agent with a continual World Model that autonomously completes long-horizon tasks through self-planning, control, and reflection, achieving significant improvements in success rates and efficiency without human intervention.

Authors:Tongtong Feng, Xin Wang, Zekai Zhou, Ren Wang, Yuwei Zhan, Guangyao Li, Qing Li, Wenwu Zhu
Title: EvoAgent: Self-evolving Agent with Continual World Model for Long-Horizon Tasks
Abstract:
Completing Long-Horizon (LH) tasks in open-ended worlds is an important yet difficult problem for embodied agents. Existing approaches suffer from two key challenges: (1) they heavily rely on experiences obtained from human-created data or curricula, failing to autonomously update and select multimodal experiences, and (2) they may encounter catastrophic forgetting issues when faced with new tasks, failing to autonomously update world knowledge. To solve these challenges, this paper presents {\it EvoAgent}, a self-evolving agent with a continual World Model (WM), which can autonomously complete various LH tasks across environments through self-planning, self-control, and self-reflection, without human intervention. Our proposed EvoAgent contains three modules, i.e., i) the memory-driven planner which uses an LLM along with the WM and interaction memory, to convert LH tasks into executable sub-tasks; ii) the WM-guided action controller which leverages WM to generate low-level actions and incorporates a self-verification mechanism to update multimodal experiences; iii) the experience-inspired reflector which implements a two-stage curriculum learning algorithm to select experiences for task-adaptive WM updates. Moreover, we develop a continual World Model for EvoAgent, which can autonomously update the multimodal experience pool and world knowledge through closed-loop dynamics. We conducted extensive experiments on Minecraft and Atair, compared with existing methods, EvoAgent can achieve an average success rate improvement of 105% and reduce ineffective actions by more than 6x.
中文: EvoAgent是一种具备持续世界模型的自进化具身智能体,通过自我规划、控制和反思自主完成长周期任务,在无需人工干预的情况下大幅提升了任务成功率与行动效率。
English: EvoAgent is a self-evolving embodied agent with a continual World Model that autonomously completes long-horizon tasks through self-planning, control, and reflection, achieving significant improvements in success rates and efficiency without human intervention.

Authors:Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu, Mark Zhao, Stephan Krusche, Alfons Kemper, Ion Stoica, Matei Zaharia, Joseph E. Gonzalez
Title: vCache: Verified Semantic Prompt Caching
Abstract:
Semantic caches return cached responses for semantically similar prompts to reduce LLM inference latency and cost. They embed cached prompts and store them alongside their response in a vector database. Embedding similarity metrics assign a numerical score to quantify the similarity between a request and its nearest neighbor prompt from the cache. Existing systems use the same static similarity threshold across all requests to determine whether two prompts can share similar responses. However, we observe that static thresholds do not give formal correctness guarantees, can result in unexpected error rates, and lead to suboptimal cache hit rates. This paper proposes vCache, the first verified semantic cache with user-defined error rate guarantees. It employs an online learning algorithm to estimate an optimal threshold for each cached prompt, enabling reliable cache responses without additional training. Our experiments show that vCache consistently meets the specified error bounds while outperforming state-of-the-art static-threshold and fine-tuned embedding baselines. We release the vCache implementation and three benchmarks to support future research.
Chinese Summary: vCache作为首个具备用户定义错误率保证的验证语义缓存,通过在线学习算法为每个缓存提示动态优化阈值,在无需额外训练的情况下超越静态阈值基准并保持稳定性能。
English Summary: vCache introduces a verified semantic cache with user-defined error guarantees, dynamically adjusting thresholds per prompt through online learning to outperform static-threshold systems while ensuring reliability without retraining.

Authors:Alan Zhu, Parth Asawa, Jared Quincy Davis, Lingjiao Chen, Boris Hanin, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia
Title: BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation
Abstract:
As the demand for high-quality data in model training grows, researchers and developers are increasingly generating synthetic data to tune and train LLMs. However, current data generation methods rely on seed sets containing tens of thousands of examples to prompt instruction-tuned models. This reliance can be especially problematic when the curation of high-quality examples is expensive or difficult. In this paper we explore the novel few-shot synthetic data generation setting -- generating a high-quality dataset from a few examples. We show that when working with only a few seed examples, instruction-tuned models used in current synthetic data methods produce insufficient diversity for downstream tasks. In contrast, we show that base models without post-training, largely untapped for synthetic data generation, offer substantially greater output diversity, albeit with lower instruction following abilities. Leveraging this insight, we propose Base-Refine (BARE), a novel two-stage method that combines the diversity of base models with the quality assurance of instruction-tuned models. BARE excels in few-shot synthetic data generation: using only 3 seed examples it generates diverse, high-quality datasets that significantly improve downstream task performance. We show that fine-tuning Llama 3.1 8B with 1,000 BARE-generated samples achieves performance comparable to state-of-the-art similarly sized models on LiveCodeBench tasks. Furthermore, data generated with BARE enables a 101% improvement for a fine-tuned Llama 3.2 1B on GSM8K over data generated by only instruction-models, and an 18.4% improvement for a fine-tuned Llama 3.1 8B over the state-of-the-art RAFT method for RAG data generation.
中文: 本文提出Base-Refine (BARE)方法,通过结合基础模型的多样性和指令调优模型的质量控制,仅需少量示例即可生成高质量合成数据集,显著提升下游任务性能。
English: This paper introduces Base-Refine (BARE), a novel two-stage method that leverages base models' diversity and instruction-tuned models' quality control to generate high-quality synthetic datasets from just a few examples, significantly enhancing downstream task performance.

Authors:Yidi Jiang, Qian Chen, Shengpeng Ji, Yu Xi, Wen Wang, Chong Zhang, Xianghu Yue, ShiLiang Zhang, Haizhou Li
Title: UniCodec: Unified Audio Codec with Single Domain-Adaptive Codebook
Abstract:
The emergence of audio language models is empowered by neural audio codecs, which establish critical mappings between continuous waveforms and discrete tokens compatible with language model paradigms. The evolutionary trends from multi-layer residual vector quantizer to single-layer quantizer are beneficial for language-autoregressive decoding. However, the capability to handle multi-domain audio signals through a single codebook remains constrained by inter-domain distribution discrepancies. In this work, we introduce UniCodec, a unified audio codec with a single codebook to support multi-domain audio data, including speech, music, and sound. To achieve this, we propose a partitioned domain-adaptive codebook method and domain Mixture-of-Experts strategy to capture the distinct characteristics of each audio domain. Furthermore, to enrich the semantic density of the codec without auxiliary modules, we propose a self-supervised mask prediction modeling approach. Comprehensive objective and subjective evaluations demonstrate that UniCodec achieves excellent audio reconstruction performance across the three audio domains, outperforming existing unified neural codecs with a single codebook, and even surpasses state-of-the-art domain-specific codecs on both acoustic and semantic representation capabilities.
中文: UniCodec是一种统一音频编解码器,采用分区域自适应码本和域专家混合策略,通过单一码本高效处理多领域音频,在语音、音乐和声音上实现了卓越的重建效果和语义表达能力。
English: UniCodec is a unified audio codec that uses a single codebook with domain-adaptive partitioning and Mixture-of-Experts strategy to effectively handle multi-domain audio, achieving superior reconstruction and semantic representation across speech, music, and sound.

Authors:Md Mehrab Tanjim, Ryan A. Rossi, Mike Rimer, Xiang Chen, Sungchul Kim, Vaishnavi Muppala, Tong Yu, Zhengmian Hu, Ritwik Sinha, Wei Zhang, Iftikhar Ahamath Burhanuddin, Franck Dernoncourt
Title: Exploring Rewriting Approaches for Different Conversational Tasks
Abstract:
Conversational assistants often require a question rewriting algorithm that leverages a subset of past interactions to provide a more meaningful (accurate) answer to the user's question or request. However, the exact rewriting approach may often depend on the use case and application-specific tasks supported by the conversational assistant, among other constraints. In this paper, we systematically investigate two different approaches, denoted as rewriting and fusion, on two fundamentally different generation tasks, including a text-to-text generation task and a multimodal generative task that takes as input text and generates a visualization or data table that answers the user's question. Our results indicate that the specific rewriting or fusion approach highly depends on the underlying use case and generative task. In particular, we find that for a conversational question-answering assistant, the query rewriting approach performs best, whereas for a data analysis assistant that generates visualizations and data tables based on the user's conversation with the assistant, the fusion approach works best. Notably, we explore two datasets for the data analysis assistant use case, for short and long conversations, and we find that query fusion always performs better, whereas for the conversational text-based question-answering, the query rewrite approach performs best.
中文: 本研究比较了对话助手中的查询重写与融合方法,发现重写方法在文本问答中表现更优,而融合方法在数据分析任务中生成可视化图表时效果更佳。
English: This study compares query rewriting and fusion methods for conversational assistants, finding that rewriting excels in text-based question-answering while fusion is superior for generating visualizations and data tables in data analysis tasks.

Authors:Haoyuan Li, Yanpeng Zhou, Tao Tang, Jifei Song, Yihan Zeng, Michael Kampffmeyer, Hang Xu, Xiaodan Liang
Title: UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting
Abstract:
Recent advancements in multi-modal 3D pre-training methods have shown promising efficacy in learning joint representations of text, images, and point clouds. However, adopting point clouds as 3D representation fails to fully capture the intricacies of the 3D world and exhibits a noticeable gap between the discrete points and the dense 2D pixels of images. To tackle this issue, we propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation. We first rely on the 3DGS representation to model the 3D world as a collection of 3D Gaussians with color and opacity, incorporating all the information of the 3D scene while establishing a strong connection with 2D images. Then, to achieve Language-Image-3D pertaining, UniGS starts with a pre-trained vision-language model to establish a shared visual and textual space through extensive real-world image-text pairs. Subsequently, UniGS employs a 3D encoder to align the optimized 3DGS with the Language-Image representations to learn unified multi-modal representations. To facilitate the extraction of global explicit 3D features by the 3D encoder and achieve better cross-modal alignment, we additionally introduce a novel Gaussian-Aware Guidance module that guides the learning of fine-grained representations of the 3D domain. Through extensive experiments across the Objaverse, ABO, MVImgNet and SUN RGBD datasets with zero-shot classification, text-driven retrieval and open-world understanding tasks, we demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation. Specifically, UniGS achieves leading results across different 3D tasks with remarkable improvements over previous SOTA, Uni3D, including on zero-shot classification (+9.36%), text-driven retrieval (+4.3%) and open-world understanding (+7.92%).
中文: 当前基于点云的多模态3D预训练方法在捕捉3D细节和与2D图像对齐方面存在不足,因此我们提出UniGS,通过集成3D高斯泼溅和高斯感知引导模块来学习统一表征,在零样本分类、检索和开放世界理解任务中取得了显著提升。
English: Recent multi-modal 3D pre-training methods using point clouds have limitations in capturing 3D intricacies and aligning with 2D images, so we propose UniGS, which integrates 3D Gaussian Splatting and a Gaussian-Aware Guidance module to learn unified representations, achieving significant improvements in zero-shot classification, retrieval, and open-world understanding tasks.

Authors:Xiuwei Chen, Sihao Lin, Xiao Dong, Zisheng Chen, Meng Cao, Jianhua Han, Hang Xu, Xiaodan Liang
Title: TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
Abstract:
Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training specialized subquadratic architectures from scratch for certain tasks is both resource-intensive and time-consuming. As a motivator, we explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. Concerning architecture disparities, we project the intermediate features into an aligned latent space before transferring knowledge. On top of that, a Weight Subcloning and Adaptive Bidirectional distillation method (WSAB) is introduced for knowledge transfer without limitations on varying layer counts. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture. Despite using less than 75% of the training data typically required for training from scratch, TransMamba boasts substantially stronger performance across various network architectures and downstream tasks, including image classification, visual question answering, and text-video retrieval. The code will be publicly available.
中文: 基于Transformer的模型训练资源消耗大,因此TransMamba框架通过两阶段方法将预训练Transformer知识迁移到Mamba架构,在减少训练数据和模型规模的情况下,显著提升了多模态任务的性能。
English: Transformer-based models are widely used but resource-intensive to train, so the TransMamba framework enables efficient knowledge transfer from pre-trained Transformers to Mamba architectures, accelerating training and improving performance across various tasks with reduced data and model size.

Authors:Xiuwei Chen, Wentao Hu, Xiao Dong, Sihao Lin, Zisheng Chen, Meng Cao, Yina Zhuang, Jianhua Han, Hang Xu, Xiaodan Liang
Title: TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba
Abstract:
Transformer-based architectures have become the backbone of both uni-modal and multi-modal foundation models, largely due to their scalability via attention mechanisms, resulting in a rich ecosystem of publicly available pre-trained models such as LLaVA, CLIP, and DeiT, etc. In parallel, emerging sub-quadratic architectures like Mamba offer promising efficiency gains by enabling global context modeling with linear complexity. However, training these architectures from scratch remains resource-intensive (e.g., in terms of data and time). Motivated by this challenge, we explore a cross-architecture knowledge transfer paradigm, termed TransMamba, that facilitates the reuse of Transformer pre-trained knowledge. We propose a two-stage framework to accelerate the training of Mamba-based models, ensuring their effectiveness across both uni-modal and multi-modal tasks. The first stage leverages pre-trained Transformer models to initialize critical components of the Mamba architecture. To bridge architectural and dimensional gaps, we develop a selective weight subcloning strategy and a layered initialization scheme that prioritizes the early $n$ layers. Building on this initialization, the second stage introduces an adaptive multi-directional knowledge distillation method. This mechanism employs layer-wise adaptive scaling factors to align Mamba representations with their Transformer counterparts, while accommodating the scanning order variations inherent to multi-modal Mamba architectures. Despite operating with a reduced training dataset and a more compact model architecture, TransMamba consistently outperforms baseline approaches across diverse mamba-based backbones (e.g., PlainMamba, Vmamba, ViM and VideoMamba) and downstream tasks (e.g., image classification, visual question answering, text-video retrieval and multimodal reasoning). All code and implementation details will be released.
中文: 基于Transformer的模型训练资源消耗大,因此TransMamba框架通过两阶段方法将预训练Transformer知识迁移到Mamba架构,在减少训练数据和模型规模的情况下,显著提升了多模态任务的性能。
English: Transformer-based models are widely used but resource-intensive to train, so the TransMamba framework enables efficient knowledge transfer from pre-trained Transformers to Mamba architectures, accelerating training and improving performance across various tasks with reduced data and model size.

Authors:Yong Zhang, Bingyuan Zhang, Zhitao Li, Ming Li, Ning Cheng, Minchuan Chen, Tao Wei, Jun Ma, Shaojun Wang, Jing Xiao
Title: Self-Enhanced Reasoning Training: Activating Latent Reasoning in Small Models for Enhanced Reasoning Distillation
Abstract:
The rapid advancement of large language models (LLMs) has significantly enhanced their reasoning abilities, enabling increasingly complex tasks. However, these capabilities often diminish in smaller, more computationally efficient models like GPT-2. Recent research shows that reasoning distillation can help small models acquire reasoning capabilities, but most existing methods focus primarily on improving teacher-generated reasoning paths. Our observations reveal that small models can generate high-quality reasoning paths during sampling, even without chain-of-thought prompting, though these paths are often latent due to their low probability under standard decoding strategies. To address this, we propose Self-Enhanced Reasoning Training (SERT), which activates and leverages latent reasoning capabilities in small models through self-training on filtered, self-generated reasoning paths under zero-shot conditions. Experiments using OpenAI's GPT-3.5 as the teacher model and GPT-2 models as the student models demonstrate that SERT enhances the reasoning abilities of small models, improving their performance in reasoning distillation.
Chinese: SERT是一种自增强训练方法,通过在零样本条件下筛选并利用小模型自身生成的高质量推理路径,有效激活其潜在推理能力,从而提升推理蒸馏任务的性能表现。
English: SERT is a self-training method that activates latent reasoning capabilities in small models by filtering and utilizing their own zero-shot generated reasoning paths, enhancing performance in reasoning distillation tasks.

Authors:Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, Junda Wu, Xintong Li, Ryan Aponte, Hanjia Lyu, Joe Barrow, Hongjie Chen, Franck Dernoncourt, Branislav Kveton, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Sungchul Kim, Zhengmian Hu, Yue Zhao, Nedim Lipka, Seunghyun Yoon, Ting-Hao Kenneth Huang, Zichao Wang, Puneet Mathur, Soumyabrata Pal, Koyel Mukherjee, Zhehao Zhang, Namyong Park, Thien Huu Nguyen, Jiebo Luo, Ryan A. Rossi, Julian McAuley
Title: From Selection to Generation: A Survey of LLM-based Active Learning
Abstract:
Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.
中文: 本综述提出了大语言模型在主动学习中的分类体系,探讨其变革性作用、对学习范式的影响及未来研究方向,旨在为研究者提供部署这些技术的最新资源。
English: This survey introduces a taxonomy and examines the transformative roles of Large Language Models in active learning, addressing their impact on learning paradigms, applications, and future research directions to serve as a resource for deploying these techniques.

Authors:Changchun Liu, Kai Zhang, Junzhe Jiang, Zixiao Kong, Qi Liu, Enhong Chen
Title: Chinese Spelling Correction: A Comprehensive Survey of Progress, Challenges, and Opportunities
Abstract:
Chinese Spelling Correction (CSC) is a critical task in natural language processing, aimed at detecting and correcting spelling errors in Chinese text. This survey provides a comprehensive overview of CSC, tracing its evolution from pre-trained language models to large language models, and critically analyzing their respective strengths and weaknesses in this domain. Moreover, we further present a detailed examination of existing benchmark datasets, highlighting their inherent challenges and limitations. Finally, we propose promising future research directions, particularly focusing on leveraging the potential of LLMs and their reasoning capabilities for improved CSC performance. To the best of our knowledge, this is the first comprehensive survey dedicated to the field of CSC. We believe this work will serve as a valuable resource for researchers, fostering a deeper understanding of the field and inspiring future advancements.
中文: 该综述系统回顾了中文拼写纠错领域从预训练模型到大型语言模型的发展历程,评估了现有基准数据集,并提出了聚焦大语言模型推理能力的未来研究方向。
English: This comprehensive survey on Chinese Spelling Correction (CSC) reviews the evolution from pre-trained to large language models, evaluates benchmark datasets, and outlines future research directions focusing on LLMs' reasoning capabilities.

Authors:Xinjie Sun, Kai Zhang, Qi Liu, Shuanghong Shen, Fei Wang, Yuxiang Guo, Enhong Chen
Title: DASKT: A Dynamic Affect Simulation Method for Knowledge Tracing
Abstract:
Knowledge Tracing (KT) predicts future performance by modeling students' historical interactions, and understanding students' affective states can enhance the effectiveness of KT, thereby improving the quality of education. Although traditional KT values students' cognition and learning behaviors, efficient evaluation of students' affective states and their application in KT still require further exploration due to the non-affect-oriented nature of the data and budget constraints. To address this issue, we propose a computation-driven approach, Dynamic Affect Simulation Knowledge Tracing (DASKT), to explore the impact of various student affective states (such as frustration, concentration, boredom, and confusion) on their knowledge states. In this model, we first extract affective factors from students' non-affect-oriented behavioral data, then use clustering and spatiotemporal sequence modeling to accurately simulate students' dynamic affect changes when dealing with different problems. Subsequently, {\color{blue}we incorporate affect with time-series analysis to improve the model's ability to infer knowledge states over time and space.} Extensive experimental results on two public real-world educational datasets show that DASKT can achieve more reasonable knowledge states under the effect of students' affective states. Moreover, DASKT outperforms the most advanced KT methods in predicting student performance. Our research highlights a promising avenue for future KT studies, focusing on achieving high interpretability and accuracy.
中文: 知识追踪(KT)通过建模学生历史互动预测学习表现,而提出的动态情感模拟知识追踪(DASKT)方法从非情感导向数据中模拟学生动态情感状态,有效提升了模型的解释性和预测准确性。
English: Knowledge Tracing (KT) models student performance by analyzing historical interactions, and the proposed Dynamic Affect Simulation Knowledge Tracing (DASKT) approach enhances KT by simulating students' dynamic affective states from non-affect-oriented data, improving both interpretability and prediction accuracy.

Authors:Xuemiao Zhang, Feiyu Duan, Liangyu Xu, Yongwei Zhou, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai
Title: FRAME: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy
Abstract:
Large language models (LLMs) have significantly advanced human language understanding and generation, with pretraining data quality and organization being crucial to their performance. Multi-stage pretraining is a promising approach, but existing methods often lack quantitative criteria for data partitioning and instead rely on intuitive heuristics. In this paper, we propose the novel Four-quadRAnt Multi-stage prEtraining strategy (FRAME), guided by the established principle of organizing the pretraining process into four stages to achieve significant loss reductions four times. This principle is grounded in two key findings: first, training on high Perplexity (PPL) data followed by low PPL data, and second, training on low PPL difference (PD) data followed by high PD data, both causing the loss to drop significantly twice and performance enhancements. By partitioning data into four quadrants and strategically organizing them, FRAME achieves a remarkable 16.8% average improvement over random across MMLU and CMMLU for the 3B model, effectively boosting LLM performance.
中文摘要:FRAME策略通过基于数据困惑度和困惑度差异的四阶段预训练组织,显著提升了语言模型的性能。
English Summary: The FRAME strategy organizes pretraining into four stages based on data perplexity and perplexity difference, achieving significant performance improvements in language models.

Authors:Wenyi Wang, Maxime Gonthier, Poornima Nookala, Haochen Pan, Ian Foster, Ioan Raicu, Kyle Chard
Title: Optimizing Fine-Grained Parallelism Through Dynamic Load Balancing on Multi-Socket Many-Core Systems
Abstract:
Achieving efficient task parallelism on many-core architectures is an important challenge. The widely used GNU OpenMP implementation of the popular OpenMP parallel programming model incurs high overhead for fine-grained, short-running tasks due to time spent on runtime synchronization. In this work, we introduce and analyze three key advances that collectively achieve significant performance gains. First, we introduce XQueue, a lock-less concurrent queue implementation to replace GNU's priority task queue and remove the global task lock. Second, we develop a scalable, efficient, and hybrid lock-free/lock-less distributed tree barrier to address the high hardware synchronization overhead from GNU's centralized barrier. Third, we develop two lock-less and NUMA-aware load balancing strategies. We evaluate our implementation using Barcelona OpenMP Task Suite (BOTS) benchmarks. We show that the use of XQueue and the distributed tree barrier can improve performance by up to 1522.8$\times$ compared to the original GNU OpenMP. We further show that lock-less load balancing can improve performance by up to 4$\times$ compared to GNU OpenMP using XQueue.
Chinese: 本研究通过引入XQueue无锁并发队列、分布式树屏障及无锁负载均衡三项关键技术,显著降低了GNU OpenMP的同步开销,使细粒度任务性能提升最高达1522.8倍。
English: This work introduces three key optimizations—XQueue, a distributed tree barrier, and lock-less load balancing—that collectively reduce synchronization overhead in GNU OpenMP, achieving performance improvements of up to 1522.8× for fine-grained tasks.

Authors:Liangyu Xu, Xuemiao Zhang, Feiyu Duan, Sirui Wang, Rongxiang Weng, Jingang Wang, Xunliang Cai
Title: FIRE: Flexible Integration of Data Quality Ratings for Effective Pre-Training
Abstract:
Selecting high-quality data can improve the pretraining efficiency of large language models (LLMs). Existing methods generally rely on heuristic techniques or single quality signals, limiting their ability to evaluate data quality comprehensively. In this work, we propose FIRE, a flexible and scalable framework for integrating multiple data quality raters, which allows for a comprehensive assessment of data quality across various dimensions. FIRE aligns multiple quality signals into a unified space, and integrates diverse data quality raters to provide a comprehensive quality signal for each data point. Further, we introduce a progressive data selection scheme based on FIRE that iteratively refines the selection of high-quality data points. Extensive experiments show that FIRE outperforms other data selection methods and significantly boosts pretrained model performance across a wide range of downstream tasks, while requiring less than 37.5\% of the training data needed by the Random baseline to reach the target performance.
中文: FIRE框架通过整合多维度数据质量评估器,能全面筛选高质量数据,显著提升大语言模型预训练效率,在减少数据量的同时优化模型性能。
English: FIRE is a flexible framework that integrates multiple data quality raters to comprehensively assess and progressively select high-quality data, significantly enhancing LLM pretraining efficiency and performance with less data.

Authors:Zhi Cen, Huaijin Pi, Sida Peng, Qing Shuai, Yujun Shen, Hujun Bao, Xiaowei Zhou, Ruizhen Hu
Title: Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation
Abstract:
This paper addresses the task of generating two-character online interactions. Previously, two main settings existed for two-character interaction generation: (1) generating one's motions based on the counterpart's complete motion sequence, and (2) jointly generating two-character motions based on specific conditions. We argue that these settings fail to model the process of real-life two-character interactions, where humans will react to their counterparts in real time and act as independent individuals. In contrast, we propose an online reaction policy, called Ready-to-React, to generate the next character pose based on past observed motions. Each character has its own reaction policy as its "brain", enabling them to interact like real humans in a streaming manner. Our policy is implemented by incorporating a diffusion head into an auto-regressive model, which can dynamically respond to the counterpart's motions while effectively mitigating the error accumulation throughout the generation process. We conduct comprehensive experiments using the challenging boxing task. Experimental results demonstrate that our method outperforms existing baselines and can generate extended motion sequences. Additionally, we show that our approach can be controlled by sparse signals, making it well-suited for VR and other online interactive environments.
本文提出Ready-to-React在线反应策略,通过为每个角色配备带扩散头的自回归模型来实时生成双人交互动作,在动态运动生成方面优于现有方法,并能通过稀疏信号控制生成连续动作序列,适用于VR等交互环境。
This paper introduces Ready-to-React, an online reaction policy that generates real-time two-character interactions by using individual autoregressive models with diffusion heads, outperforming existing methods in dynamic motion generation and enabling extended sequences controllable by sparse signals for VR applications.

Authors:Haibin Chen, Kangtao Lv, Chengwei Hu, Yanshi Li, Yujin Yuan, Yancheng He, Xingyao Zhang, Langming Liu, Shilei Liu, Wenbo Su, Bo Zheng
Title: ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models
Abstract:
With the increasing use of Large Language Models (LLMs) in fields such as e-commerce, domain-specific concept evaluation benchmarks are crucial for assessing their domain capabilities. Existing LLMs may generate factually incorrect information within the complex e-commerce applications. Therefore, it is necessary to build an e-commerce concept benchmark. Existing benchmarks encounter two primary challenges: (1) handle the heterogeneous and diverse nature of tasks, (2) distinguish between generality and specificity within the e-commerce field. To address these problems, we propose \textbf{ChineseEcomQA}, a scalable question-answering benchmark focused on fundamental e-commerce concepts. ChineseEcomQA is built on three core characteristics: \textbf{Focus on Fundamental Concept}, \textbf{E-commerce Generality} and \textbf{E-commerce Expertise}. Fundamental concepts are designed to be applicable across a diverse array of e-commerce tasks, thus addressing the challenge of heterogeneity and diversity. Additionally, by carefully balancing generality and specificity, ChineseEcomQA effectively differentiates between broad e-commerce concepts, allowing for precise validation of domain capabilities. We achieve this through a scalable benchmark construction process that combines LLM validation, Retrieval-Augmented Generation (RAG) validation, and rigorous manual annotation. Based on ChineseEcomQA, we conduct extensive evaluations on mainstream LLMs and provide some valuable insights. We hope that ChineseEcomQA could guide future domain-specific evaluations, and facilitate broader LLM adoption in e-commerce applications.
中文: 该摘要介绍了ChineseEcomQA,这是一个可扩展的问答基准,旨在通过基础概念解决任务异质性和平衡通用性与专业性的挑战,以评估大语言模型在电子商务领域的专业能力。
English: The abstract introduces ChineseEcomQA, a scalable question-answering benchmark designed to evaluate large language models' domain capabilities in e-commerce by addressing challenges of task heterogeneity and balancing generality with specificity through fundamental concepts.

Authors:Dingkun Yan, Xinrui Wang, Zhuoru Li, Suguru Saito, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Title: Image Referenced Sketch Colorization Based on Animation Creation Workflow
Abstract:
Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial guidance and an RGB image as the color reference, and separately extracts foreground and background from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules. They are trained separately with foreground and background regions to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the spatial artifacts. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component. Codes are available at https://github.com/ tellurion-kanata/colorizeDiffusion.
中文摘要:本文提出了一种基于扩散模型的框架,通过草图提供空间引导、参考图像提供色彩,采用分离交叉注意力机制与LoRA模块分别处理前景与背景,有效消除伪影并实现高质量线稿上色。
English Summary: This paper introduces a diffusion-based framework that uses sketches for spatial guidance and reference images for color, employing a split cross-attention mechanism with LoRA modules to independently process foreground and background, effectively eliminating artifacts and enabling high-quality sketch colorization.

Authors:Langming Liu, Shilei Liu, Yujin Yuan, Yizhen Zhang, Bencheng Yan, Zhiyuan Zeng, Zihao Wang, Jiaqi Liu, Di Wang, Wenbo Su, Pengjie Wang, Jian Xu, Bo Zheng
Title: UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering
Abstract:
Large language models (LLMs) achieve remarkable success in natural language processing (NLP). In practical scenarios like recommendations, as users increasingly seek personalized experiences, it becomes crucial to incorporate user interaction history into the context of LLMs to enhance personalization. However, from a practical utility perspective, user interactions' extensive length and noise present challenges when used directly as text prompts. A promising solution is to compress and distill interactions into compact embeddings, serving as soft prompts to assist LLMs in generating personalized responses. Although this approach brings efficiency, a critical concern emerges: Can user embeddings adequately capture valuable information and prompt LLMs? To address this concern, we propose \name, a benchmark designed to evaluate the effectiveness of user embeddings in prompting LLMs for personalization. We establish a fair and standardized evaluation process, encompassing pre-training, fine-tuning, and evaluation stages. To thoroughly evaluate user embeddings, we design three dimensions of tasks: sequence understanding, action prediction, and interest perception. These evaluation tasks cover the industry's demands in traditional recommendation tasks, such as improving prediction accuracy, and its aspirations for LLM-based methods, such as accurately understanding user interests and enhancing the user experience. We conduct extensive experiments on various state-of-the-art methods for modeling user embeddings. Additionally, we reveal the scaling laws of leveraging user embeddings to prompt LLMs. The benchmark is available online.
中文摘要:大语言模型通过将用户交互压缩为嵌入向量来增强个性化,提出的\name基准从三个任务维度评估其有效性,确保这些嵌入能有效捕捉用户信息以优化推荐系统。
English Summary: Large language models enhance personalization by compressing user interactions into embeddings, and the proposed \name benchmark evaluates their effectiveness across three task dimensions to ensure they capture valuable user information for improved recommendations.

Authors:Mingdai Yang, Zhiwei Liu, Liangwei Yang, Xiaolong Liu, Chen Wang, Hao Peng, Philip S. Yu
Title: Training Large Recommendation Models via Graph-Language Token Alignment
Abstract:
Recommender systems (RS) have become essential tools for helping users efficiently navigate the overwhelming amount of information on e-commerce and social platforms. However, traditional RS relying on Collaborative Filtering (CF) struggles to integrate the rich semantic information from textual data. Meanwhile, large language models (LLMs) have shown promising results in natural language processing, but directly using LLMs for recommendation introduces challenges, such as ambiguity in generating item predictions and inefficiencies in scalability. In this paper, we propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment. By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs. Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction, eliminating ambiguity in the free-form text as recommendation results. Extensive experiments on three benchmark datasets demonstrate the effectiveness of GLTA, with ablation studies validating each component.
中文摘要:本文提出了一种图语言令牌对齐(GLTA)框架,通过将基于图的用户-物品交互与预训练语言模型令牌对齐,有效利用大语言模型的推理能力,同时通过端到端优化消除推荐结果的模糊性。
English Summary: This paper introduces a Graph-Language Token Alignment (GLTA) framework that enhances recommender systems by aligning graph-based user-item interactions with pretrained language model tokens, effectively leveraging LLMs' reasoning while eliminating prediction ambiguity through end-to-end optimization.

Authors:Ru Wang, Wei Huang, Selena Song, Haoyu Zhang, Yusuke Iwasawa, Yutaka Matsuo, Jiaxian Guo
Title: Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization
Abstract:
Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs). This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization. Through controlled experiments across several compound tasks, we reveal three key insights: (1) While QA-trained models achieve near-perfect in-distribution accuracy, their OOD performance degrades catastrophically, even with 10000k+ training examples; (2) the granularity of CoT data strongly correlates with generalization performance; finer-grained CoT data leads to better generalization; (3) CoT exhibits remarkable sample efficiency, matching QA performance with much less (even 80%) data. Theoretically, we demonstrate that compound tasks inherently permit shortcuts in Q-A data that misalign with true reasoning principles, while CoT forces internalization of valid dependency structures, and thus can achieve better generalization. Further, we show that transformer positional embeddings can amplify generalization by emphasizing subtask condition recurrence in long CoT sequences. Our combined theoretical and empirical analysis provides compelling evidence for CoT reasoning as a crucial training paradigm for enabling LM generalization under real-world distributional shifts for compound tasks.
中文摘要:本研究表明,思维链推理通过强化有效推理结构并展现优于标准问答训练的样本效率,能显著提升Transformer语言模型在复合任务中的分布外泛化能力。
English Summary: This study demonstrates that Chain-of-Thought (CoT) reasoning significantly improves transformer language models' out-of-distribution generalization for compound tasks by enforcing valid reasoning structures and showing superior sample efficiency compared to standard question-answering training.

Authors:Wei Liu, Yancheng He, Hui Huang, Chengwei Hu, Jiaheng Liu, Shilong Li, Wenbo Su, Bo Zheng
Title: AIR: Complex Instruction Generation via Automatic Iterative Refinement
Abstract:
With the development of large language models, their ability to follow simple instructions has significantly improved. However, adhering to complex instructions remains a major challenge. Current approaches to generating complex instructions are often irrelevant to the current instruction requirements or suffer from limited scalability and diversity. Moreover, methods such as back-translation, while effective for simple instruction generation, fail to leverage the rich contents and structures in large web corpora. In this paper, we propose a novel automatic iterative refinement framework to generate complex instructions with constraints, which not only better reflects the requirements of real scenarios but also significantly enhances LLMs' ability to follow complex instructions. The AIR framework consists of two stages: (1)Generate an initial instruction from a document; (2)Iteratively refine instructions with LLM-as-judge guidance by comparing the model's output with the document to incorporate valuable constraints. Finally, we construct the AIR-10K dataset with 10K complex instructions and demonstrate that instructions generated with our approach significantly improve the model's ability to follow complex instructions, outperforming existing methods for instruction generation.
Chinese: 本文提出了一种自动迭代优化(AIR)框架,通过基于文档的约束和以大型语言模型为评判的指导生成复杂指令,显著提升了大型语言模型遵循复杂指令的能力,并优于现有方法。
English: The paper introduces an automatic iterative refinement (AIR) framework that generates complex instructions by leveraging document-based constraints and LLM-as-judge guidance, significantly enhancing large language models' ability to follow intricate directives and outperforming existing methods.

Authors:Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu
Title: Thus Spake Long-Context Large Language Model
Abstract:
Long context is an important topic in Natural Language Processing (NLP), running through the development of NLP architectures, and offers immense opportunities for Large Language Models (LLMs) giving LLMs the lifelong learning potential akin to humans. Unfortunately, the pursuit of a long context is accompanied by numerous obstacles. Nevertheless, long context remains a core competitive advantage for LLMs. In the past two years, the context length of LLMs has achieved a breakthrough extension to millions of tokens. Moreover, the research on long-context LLMs has expanded from length extrapolation to a comprehensive focus on architecture, infrastructure, training, and evaluation technologies. Inspired by the symphonic poem, Thus Spake Zarathustra, we draw an analogy between the journey of extending the context of LLM and the attempts of humans to transcend its mortality. In this survey, We will illustrate how LLM struggles between the tremendous need for a longer context and its equal need to accept the fact that it is ultimately finite. To achieve this, we give a global picture of the lifecycle of long-context LLMs from four perspectives: architecture, infrastructure, training, and evaluation, showcasing the full spectrum of long-context technologies. At the end of this survey, we will present 10 unanswered questions currently faced by long-context LLMs. We hope this survey can serve as a systematic introduction to the research on long-context LLMs.
中文: 长上下文是大语言模型的核心竞争优势,既带来巨大机遇又面临诸多挑战,近年来上下文长度已突破至百万令牌,研究范围也从长度外推扩展到架构、基础设施、训练和评估技术的全面关注。
English: Long context is a crucial competitive advantage for Large Language Models, presenting both immense opportunities and significant challenges, with recent breakthroughs extending context length to millions of tokens and expanding research across architecture, infrastructure, training, and evaluation.

Authors:Chenghao Fan, Zhenyi Lu, Sichen Liu, Chengfeng Gu, Xiaoye Qu, Wei Wei, Yu Cheng
Title: Make LoRA Great Again: Boosting LoRA with Adaptive Singular Values and Mixture-of-Experts Optimization Alignment
Abstract:
While Low-Rank Adaptation (LoRA) enables parameter-efficient fine-tuning for Large Language Models (LLMs), its performance often falls short of Full Fine-Tuning (Full FT). Current methods optimize LoRA by initializing with static singular value decomposition (SVD) subsets, leading to suboptimal leveraging of pre-trained knowledge. Another path for improving LoRA is incorporating a Mixture-of-Experts (MoE) architecture. However, weight misalignment and complex gradient dynamics make it challenging to adopt SVD prior to the LoRA MoE architecture. To mitigate these issues, we propose \underline{G}reat L\underline{o}R\underline{A} Mixture-of-Exper\underline{t} (GOAT), a framework that (1) adaptively integrates relevant priors using an SVD-structured MoE, and (2) aligns optimization with full fine-tuned MoE by deriving a theoretical scaling factor. We demonstrate that proper scaling, without modifying the architecture or training algorithms, boosts LoRA MoE's efficiency and performance. Experiments across 25 datasets, including natural language understanding, commonsense reasoning, image classification, and natural language generation, demonstrate GOAT's state-of-the-art performance, closing the gap with Full FT.
中文: 提出的GOAT框架通过集成SVD结构的专家混合和理论缩放因子,提升了LoRA的性能,在多种任务中达到领先水平,缩小了与全参数微调的差距。
English: The proposed GOAT framework enhances LoRA's performance by integrating an SVD-structured Mixture-of-Experts and a theoretical scaling factor, achieving state-of-the-art results across diverse tasks and narrowing the gap with Full Fine-Tuning.

Authors:Jiancheng An, Zhu Han, Dusit Niyato, Mérouane Debbah, Chau Yuen, Lajos Hanzo
Title: Flexible Intelligent Metasurfaces for Enhancing MIMO Communications
Abstract:
Flexible intelligent metasurfaces (FIMs) show great potential for improving the wireless network capacity in an energy-efficient manner. An FIM is a soft array consisting of several low-cost radiating elements. Each element can independently emit electromagnetic signals, while flexibly adjusting its position even perpendicularly to the overall surface to `morph' its 3D shape. More explicitly, compared to a conventional rigid antenna array, an FIM is capable of finding an optimal 3D surface shape that provides improved signal quality. In this paper, we study point-to-point multiple-input multiple-output (MIMO) communications between a pair of FIMs. In order to characterize the capacity limits of FIM-aided MIMO transmissions over frequency-flat fading channels, we formulate a transmit optimization problem for maximizing the MIMO channel capacity by jointly optimizing the 3D surface shapes of the transmitting and receiving FIMs as well as the MIMO transmit covariance matrix, subject to the total transmit power constraint and to the maximum perpendicular morphing range of the FIM. To solve this problem, we develop an efficient block coordinate descent (BCD) algorithm. The BCD algorithm iteratively updates the 3D surface shapes of the FIMs and the transmit covariance matrix, while keeping the other fixed, to find a locally optimal solution. Numerical results verify that FIMs can achieve higher MIMO capacity than that of the conventional rigid arrays. In particular, the MIMO channel capacity can be doubled by the proposed BCD algorithm under some setups.
Chinese: 柔性智能超表面通过优化三维形态和传输参数,显著提升MIMO通信容量,其高效算法可使信道容量在特定配置下达到传统刚性天线的两倍。
English: Flexible intelligent metasurfaces (FIMs) enhance MIMO communication capacity by optimizing their 3D shapes and transmit parameters, achieving up to double the capacity of rigid arrays with an efficient algorithm.

Authors:Loc X. Nguyen, Avi Deb Raha, Pyae Sone Aung, Dusit Niyato, Zhu Han, Choong Seon Hong
Title: A Contemporary Survey on Semantic Communications:Theory of Mind, Generative AI, and Deep Joint Source-Channel Coding
Abstract:
Semantic Communication is becoming the next pillar in wireless communication technology due to its various capabilities. However, it still encounters various challenging obstacles that need to be solved before real-world deployment. The major challenge is the lack of standardization across different directions, leading to variations in interpretations and objectives. In the survey, we provide detailed explanations of three leading directions in semantic communications, namely Theory of Mind, Generative AI, Deep Joint Source-Channel Coding. These directions have been widely studied, developed, and verified by institutes worldwide, and their effectiveness has increased along with the advancement in technology. We first introduce the concepts and background of these directions. Firstly, we introduce the Theory of Mind, where the communication agents interact with each other, gaining understanding from observations and slowly forming a common language. Secondly, we present generative AI models, which can create new content and offer more freedom to interpret the data beyond the limitation of semantic meaning compression of raw data before transmitting it. The received signal is then decoded by another generative AI model to execute the oriented task. Thirdly, we review deep learning models to jointly optimize the source and channel coding modules. Then, we present a comprehensive survey of existing works in each direction, thereby offering readers an overview of past achievements and potential avenues for further contribution. Moreover, for each direction, we identify and discuss the existing challenges that must be addressed before these approaches can be effectively deployed in real-world scenarios.
中文摘要:语义通信作为下一代无线通信技术的支柱,通过心智理论、生成式人工智能和深度联合信源信道编码三大方向提升智能交互与传输效率,但仍面临标准化、可扩展性等关键挑战。
English Summary: Semantic communication is advancing as a key wireless technology with three main approaches—Theory of Mind-based, Generative AI-driven, and Deep Joint Source-Channel Coding—that improve intelligent interaction and efficiency, yet face challenges like standardization and scalability.

Authors:Loc X. Nguyen, Avi Deb Raha, Pyae Sone Aung, Dusit Niyato, Zhu Han, Choong Seon Hong
Title: A Contemporary Survey on Semantic Communications:Theory of Mind, Generative AI, and Deep Joint Source-Channel Coding
Abstract:
Semantic communication is emerging as the next pillar in wireless communication technology due to its transformative capabilities in reducing communication overhead, enhancing robustness, and enabling intelligent information exchange. The most significant obstacle lies in the lack of standardization across various research directions, leading to inconsistencies in interpretation, objectives, and evaluation. In this survey, we provide an in-depth overview of three leading directions in semantic communication, namely Theory of Mind-based semantic communication, Generative AI-driven semantic communication, and Deep Joint Source-Channel Coding (DJSCC)-based semantic communication. These directions have been extensively studied and developed by research institutes worldwide, and their effectiveness continues to improve alongside advances in communication and computing technologies. The ToM-based semantic communication enables communication agents to interact intelligently, infer each other's intentions, and gradually form a shared understanding. The GAI-based semantic communication leverages generative models to create and interpret content beyond traditional compression, allowing flexible semantic encoding and decoding tailored to specific tasks. The DJSCC-based semantic communication direction integrates DL models to jointly optimize the source and channel coding processes for efficient semantic information transfer. Next, we present a detailed survey of existing works under each direction and open research problems in semantic communication. Furthermore, we identify and analyze critical challenges, such as scalability and adaptability, that currently hinder the deployment of semantic communication systems. Finally, we discuss potential research opportunities and future directions such as quantum computing to further enhance the capabilities of semantic communication.
中文摘要:语义通信作为下一代无线通信技术的支柱,通过心智理论、生成式人工智能和深度联合信源信道编码三大方向提升智能交互与传输效率,但仍面临标准化、可扩展性等关键挑战。
English Summary: Semantic communication is advancing as a key wireless technology with three main approaches—Theory of Mind-based, Generative AI-driven, and Deep Joint Source-Channel Coding—that improve intelligent interaction and efficiency, yet face challenges like standardization and scalability.

Authors:Xiwei Xu, Cesare Pautasso, Sin Kuang Lo, Liming Zhu, Qinghua Lu, Ingo Weber
Title: An Extended Pattern Collection for Blockchain-based Applications
Abstract:
Blockchain is an emerging technology that enables new forms of decentralized software architectures, where distributed components can reach agreements on shared system states without trusting a central integration point. Blockchain provides a shared infrastructure to execute programs, called smart contracts, and to store data. Since blockchain technologies are at an early stage, there is a lack of a systematically organized knowledge providing a holistic view on designing software systems that use blockchain. We view blockchain as a component of a bigger software system, which requires patterns for using blockchain in the design of the software architecture. In this paper, we collect a list of patterns for blockchain-based applications. The pattern collection is categorized into five categories, including interaction with external world patterns, data management patterns, security patterns, structural patterns of contracts, and user interaction patterns. Some patterns are designed considering the nature of blockchain and how blockchains can be specifically introduced within real-world applications. Others are variants of existing design patterns applied in the context of blockchain-based applications and smart contracts.
中文: 区块链作为去中心化基础设施,用于执行智能合约和存储数据,但缺乏系统化的设计模式来有效整合到软件系统中,因此本文归纳了五类适用于区块链应用的模式。
English: Blockchain serves as a decentralized infrastructure for executing smart contracts and storing data, yet there is a need for systematic design patterns to integrate it effectively into software systems, leading to the categorization of five pattern types for blockchain-based applications.

Authors:Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, Xiangnan He
Title: Less is More: Improving LLM Alignment via Preference Data Selection
Abstract:
Direct Preference Optimization (DPO) has emerged as a promising approach for aligning large language models with human preferences. While prior work mainly extends DPO from the aspect of the objective function, we instead improve DPO from the largely overlooked but critical aspect of data selection. Specifically, we address the issue of parameter shrinkage caused by noisy data by proposing a novel margin-maximization principle for dataset curation in DPO training. To further mitigate the noise in different reward models, we propose a Bayesian Aggregation approach that unifies multiple margin sources (external and implicit) into a single preference probability. Extensive experiments in diverse settings demonstrate the consistently high data efficiency of our approach. Remarkably, by using just 10\% of the Ultrafeedback dataset, our approach achieves 3\% to 8\% improvements across various Llama, Mistral, and Qwen models on the AlpacaEval2 benchmark. Furthermore, our approach seamlessly extends to iterative DPO, yielding a roughly 3\% improvement with 25\% online data, revealing the high redundancy in this presumed high-quality data construction manner. These results highlight the potential of data selection strategies for advancing preference optimization.
中文: 本研究通过引入边界最大化数据选择原则和贝叶斯聚合方法,有效降低了噪声影响,在多种模型上仅用少量数据就显著提升了直接偏好优化的性能。
English: This study enhances Direct Preference Optimization by introducing a margin-maximization principle for data selection and a Bayesian Aggregation method to mitigate noise, achieving significant performance improvements with minimal data across multiple models.

Authors:Jihao Gu, Yingyao Wang, Pi Bu, Chen Wang, Ziming Wang, Tengtao Song, Donglai Wei, Jiale Yuan, Yingxiu Zhao, Yancheng He, Shilong Li, Jiaheng Liu, Meng Cao, Jun Song, Yingshui Tan, Xiang Li, Wenbo Su, Zhicheng Zheng, Xiaoyong Zhu, Bo Zheng
Title: "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models
Abstract:
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models' knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field. Our evaluation-friendly code and data have already been open-sourced.
中文: 本文提出了首个中文视觉问答基准ChineseSimpleVQA,通过解构视觉事实性为“观察世界”和“发现知识”两个维度,系统评估了34个大型视觉语言模型在不同主题下的表现,揭示了该领域存在的显著性能差距。
English: This paper introduces ChineseSimpleVQA, the first Chinese visual question-answering benchmark for evaluating the factual accuracy of large vision-language models across diverse topics, revealing significant performance gaps through systematic assessment of 34 models.

Authors:Yingshui Tan, Yilei Jiang, Yanshi Li, Jiaheng Liu, Xingyuan Bu, Wenbo Su, Xiangyu Yue, Xiaoyong Zhu, Bo Zheng
Title: Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Abstract:
Fine-tuning large language models (LLMs) based on human preferences, commonly achieved through reinforcement learning from human feedback (RLHF), has been effective in improving their performance. However, maintaining LLM safety throughout the fine-tuning process remains a significant challenge, as resolving conflicts between safety and helpfulness can be non-trivial. Typically, the safety alignment of LLM is trained on data with safety-related categories. However, our experiments find that naively increasing the scale of safety training data usually leads the LLMs to an ``overly safe'' state rather than a ``truly safe'' state, boosting the refusal rate through extensive safety-aligned data without genuinely understanding the requirements for safe responses. Such an approach can inadvertently diminish the models' helpfulness. To understand the phenomenon, we first investigate the role of safety data by categorizing them into three different groups, and observe that each group behaves differently as training data scales up. To boost the balance between safety and helpfulness, we propose an Equilibrate RLHF framework including a Fine-grained Data-centric (FDC) approach that achieves better safety alignment even with fewer training data, and an Adaptive Message-wise Alignment (AMA) approach, which selectively highlight the key segments through a gradient masking strategy. Extensive experimental results demonstrate that our approach significantly enhances the safety alignment of LLMs while balancing safety and helpfulness.
中文摘要:本研究提出一种均衡RLHF框架,通过细粒度数据分类和自适应对齐策略,解决大语言模型在安全对齐过程中因过度训练而导致拒绝率升高、实用性下降的问题,实现了安全性与实用性的更好平衡。
English Summary: The study introduces an Equilibrate RLHF framework that addresses the challenge of LLMs becoming overly safe at the cost of helpfulness by employing fine-grained data categorization and adaptive alignment strategies to better balance safety and performance.

Authors:Peiji Li, Kai Lv, Yunfan Shao, Yichuan Ma, Linyang Li, Xiaoqing Zheng, Xipeng Qiu, Qipeng Guo
Title: FastMCTS: A Simple Sampling Strategy for Data Synthesis
Abstract:
Synthetic high-quality multi-step reasoning data can significantly enhance the performance of large language models on various tasks. However, most existing methods rely on rejection sampling, which generates trajectories independently and suffers from inefficiency and imbalanced sampling across problems of varying difficulty. In this work, we introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search. FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals and promoting balanced sampling across problems of different difficulty levels. Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30\% more correct reasoning paths compared to rejection sampling as the number of generated tokens scales up. Furthermore, under comparable synthetic data budgets, models trained on FastMCTS-generated data outperform those trained on rejection sampling data by 3.9\% across multiple benchmarks. As a lightweight sampling strategy, FastMCTS offers a practical and efficient alternative for synthesizing high-quality reasoning data. Our code will be released soon.
中文摘要:FastMCTS作为一种基于蒙特卡洛树搜索的高效数据合成策略,相比拒绝采样能生成更优质的多步推理数据,正确推理路径增加30%以上,并在相同数据预算下使模型性能提升3.9%。
English Summary: FastMCTS is an efficient Monte Carlo Tree Search-based strategy that generates higher quality multi-step reasoning data than rejection sampling, producing 30% more correct paths and improving model performance by 3.9% across benchmarks.

Authors:Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li, Xipeng Qiu, Kai Chen
Title: UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale pre-training data and (ii) synthesizing instruction data through prompt engineering with powerful models. While pre-training data faces quality consistency issues, instruction-based synthesis suffers from limited instruction diversity and inherent biases of LLMs. To address this gap, we introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to both guide and validate the code generation process. Combined with large-scale package-based retrieval from pre-training corpus, we generate a dataset of 500K+ verifiable programs containing diverse API calls. Evaluations on multiple Python benchmarks (BigCodeBench, HumanEval, MBPP) demonstrate that models fine-tuned on our synthetic data exhibit consistent performance improvements. Notably, Llama3.1-8B and InternLM2.5-7B improve from 31\% and 28\% to 40\% and 39\% success rates on BigCodeBench, respectively. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora, demonstrating the potential for producing diverse and high-quality post-training data at scale. All code and data will be released (https://github.com).
中文摘要:UnitCoder通过模型生成的单元测试来引导和验证代码生成,创建了50多万个可验证程序,经过微调的模型在多个基准测试中性能得到显著提升。
English Summary: UnitCoder introduces a systematic pipeline using model-generated unit tests to guide and validate code generation, creating over 500K verifiable programs that improve model performance on multiple benchmarks when fine-tuned.

Authors:Chengshuai Zhao, Zhen Tan, Chau-Wai Wong, Xinyan Zhao, Tianlong Chen, Huan Liu
Title: SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention
Abstract:
Content analysis breaks down complex and unstructured texts into theory-informed numerical categories. Particularly, in social science, this process usually relies on multiple rounds of manual annotation, domain expert discussion, and rule-based refinement. In this paper, we introduce SCALE, a novel multi-agent framework that effectively $\underline{\textbf{S}}$imulates $\underline{\textbf{C}}$ontent $\underline{\textbf{A}}$nalysis via $\underline{\textbf{L}}$arge language model (LLM) ag$\underline{\textbf{E}}$nts. SCALE imitates key phases of content analysis, including text coding, collaborative discussion, and dynamic codebook evolution, capturing the reflective depth and adaptive discussions of human researchers. Furthermore, by integrating diverse modes of human intervention, SCALE is augmented with expert input to further enhance its performance. Extensive evaluations on real-world datasets demonstrate that SCALE achieves human-approximated performance across various complex content analysis tasks, offering an innovative potential for future social science research.
Chinese: 本文提出SCALE多智能体框架,利用大语言模型智能体模拟内容分析的关键环节,包括文本编码、协作讨论和动态编码本演化,通过专家干预增强性能,在真实数据集上实现了接近人类水平的表现。
English: This paper introduces SCALE, a multi-agent framework that simulates content analysis using large language model agents to automate text coding, collaborative discussion, and codebook evolution, achieving human-like performance in social science tasks through expert-augmented interventions.

Authors:Ziye Jia, Yilu Cao, Lijun He, Qihui Wu, Qiuming Zhu, Dusit Niyato, Zhu Han
Title: Service Function Chain Dynamic Scheduling in Space-Air-Ground Integrated Networks
Abstract:
As an important component of the sixth generation communication technologies, the space-air-ground integrated network (SAGIN) attracts increasing attentions in recent years. However, due to the mobility and heterogeneity of the components such as satellites and unmanned aerial vehicles in multi-layer SAGIN, the challenges of inefficient resource allocation and management complexity are aggregated. To this end, the network function virtualization technology is introduced and can be implemented via service function chains (SFCs) deployment. However, urgent unexpected tasks may bring conflicts and resource competition during SFC deployment, and how to schedule the SFCs of multiple tasks in SAGIN is a key issue. In this paper, we address the dynamic and complexity of SAGIN by presenting a reconfigurable time extension graph and further propose the dynamic SFC scheduling model. Then, we formulate the SFC scheduling problem to maximize the number of successful deployed SFCs within limited resources and time horizons. Since the problem is in the form of integer linear programming and intractable to solve, we propose the algorithm by incorporating deep reinforcement learning. Finally, simulation results show that the proposed algorithm has better convergence and performance compared to other benchmark algorithms.
Chinese: 本文针对空天地一体化网络中资源分配低效和管理复杂的问题,提出了动态服务功能链调度模型和基于深度强化学习的算法,以在有限资源和时间内最大化成功部署的服务功能链数量。
English: This paper addresses the challenges of resource allocation and management complexity in the space-air-ground integrated network by proposing a dynamic service function chain scheduling model and a deep reinforcement learning-based algorithm to maximize successful deployments under constrained resources and time.

Authors:Bencheng Yan, Shilei Liu, Zhiyuan Zeng, Zihao Wang, Yizhen Zhang, Yujin Yuan, Langming Liu, Jiaqi Liu, Di Wang, Wenbo Su, Wang Pengjie, Jian Xu, Bo Zheng
Title: Unlocking Scaling Law in Industrial Recommendation Systems with a Three-step Paradigm based Large User Model
Abstract:
Recent advancements in autoregressive Large Language Models (LLMs) have achieved significant milestones, largely attributed to their scalability, often referred to as the "scaling law". Inspired by these achievements, there has been a growing interest in adapting LLMs for Recommendation Systems (RecSys) by reformulating RecSys tasks into generative problems. However, these End-to-End Generative Recommendation (E2E-GR) methods tend to prioritize idealized goals, often at the expense of the practical advantages offered by traditional Deep Learning based Recommendation Models (DLRMs) in terms of in features, architecture, and practices. This disparity between idealized goals and practical needs introduces several challenges and limitations, locking the scaling law in industrial RecSys. In this paper, we introduce a large user model (LUM) that addresses these limitations through a three-step paradigm, designed to meet the stringent requirements of industrial settings while unlocking the potential for scalable recommendations. Our extensive experimental evaluations demonstrate that LUM outperforms both state-of-the-art DLRMs and E2E-GR approaches. Notably, LUM exhibits excellent scalability, with performance improvements observed as the model scales up to 7 billion parameters. Additionally, we have successfully deployed LUM in an industrial application, where it achieved significant gains in an A/B test, further validating its effectiveness and practicality.
自回归大语言模型在推荐系统中潜力巨大,但常忽视传统模型的实用优势,因此新提出的大型用户模型成功融合了可扩展性与工业需求,并在实验中超越了现有方法。
Autoregressive large language models have shown promise in recommendation systems but often overlook practical benefits of traditional models, leading to a new large user model that effectively combines scalability with industrial requirements and outperforms existing methods.

Authors:Zicheng Liu, Siyuan Li, Zhiyuan Chen, Fang Wu, Chang Yu, Qirong Yang, Yucheng Guo, Yujie Yang, Xiaoming Zhang, Stan Z. Li
Title: Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification
Abstract:
The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. Although modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains underexplored. This paper follows the guidance of the central dogma to redesign both the data and model pipeline and offers a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions between coding and non-coding regions through masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive experiments show that Life-Code achieves state-of-the-art results on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.
中文: 本文提出Life-Code框架,遵循中心法则重构数据和模型流程,通过整合多组学数据和混合长序列架构捕捉生物大分子间的相互作用,在多组学任务中取得领先性能。
English: This paper introduces Life-Code, a comprehensive framework that redesigns data and model pipelines based on the central dogma to capture interactions between DNA, RNA, and proteins, achieving state-of-the-art results across multi-omics tasks.

Authors:Li Sun, Ziheng Zhang, Zixi Wang, Yujie Wang, Qiqi Wan, Hao Li, Hao Peng, Philip S. Yu
Title: Pioneer: Physics-informed Riemannian Graph ODE for Entropy-increasing Dynamics
Abstract:
Dynamic interacting system modeling is important for understanding and simulating real world systems. The system is typically described as a graph, where multiple objects dynamically interact with each other and evolve over time. In recent years, graph Ordinary Differential Equations (ODE) receive increasing research attentions. While achieving encouraging results, existing solutions prioritize the traditional Euclidean space, and neglect the intrinsic geometry of the system and physics laws, e.g., the principle of entropy increasing. The limitations above motivate us to rethink the system dynamics from a fresh perspective of Riemannian geometry, and pose a more realistic problem of physics-informed dynamic system modeling, considering the underlying geometry and physics law for the first time. In this paper, we present a novel physics-informed Riemannian graph ODE for a wide range of entropy-increasing dynamic systems (termed as Pioneer). In particular, we formulate a differential system on the Riemannian manifold, where a manifold-valued graph ODE is governed by the proposed constrained Ricci flow, and a manifold preserving Gyro-transform aware of system geometry. Theoretically, we report the provable entropy non-decreasing of our formulation, obeying the physics laws. Empirical results show the superiority of Pioneer on real datasets.
中文摘要:本文提出了一种新颖的物理信息黎曼图常微分方程方法Pioneer,通过结合黎曼几何和物理定律来模拟熵增动态系统,在理论上遵循熵增原理并在实际数据集上展现出优越性能。
English Summary: This paper introduces Pioneer, a novel physics-informed Riemannian graph ODE that models entropy-increasing dynamic systems by incorporating Riemannian geometry and physics laws, demonstrating theoretical adherence to entropy principles and empirical superiority on real datasets.

Authors:Wentao Shi, Zichun Yu, Fuli Feng, Xiangnan He, Chenyan Xiong
Title: Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search
Abstract:
Monte Carlo Tree Search (MCTS) based methods provide promising approaches for generating synthetic data to enhance the self-training of Large Language Model (LLM) based multi-agent systems (MAS). These methods leverage Q-values to estimate individual agent contributions. However, relying solely on Q-values to identify informative data may misalign with the data synthesis objective, as the focus should be on selecting data that best enhances model training. To address this discrepancy, we propose Data Influence-oriented Tree Search (DITS), a novel framework that incorporates influence scores to guide both tree search and data selection. By leveraging influence scores, we effectively identify the most impactful data for system improvement, thereby enhancing model performance. Furthermore, we derive influence score estimation methods tailored for non-differentiable metrics, significantly reducing computational overhead by utilizing inference computations. Extensive experiments on eight multi-agent datasets demonstrate the robustness and effectiveness of the proposed methods. Notably, our findings reveal that allocating more inference resources to estimate influence scores, rather than Q-values, during data synthesis can more effectively and efficiently enhance model training.
中文: 提出的数据影响导向树搜索(DITS)框架通过采用影响力评分替代Q值来优化多智能体系统的合成数据生成,使数据选择更契合训练目标,在显著提升模型性能的同时有效降低了计算开销。
English: The proposed Data Influence-oriented Tree Search (DITS) framework improves synthetic data generation for multi-agent systems by using influence scores instead of Q-values to better align data selection with training objectives, significantly boosting model performance while reducing computational costs.

Authors:Bencheng Yan, Si Chen, Shichang Jia, Jianyu Liu, Yueran Liu, Chenghan Fu, Wanxian Guan, Hui Zhao, Xiang Zhang, Kai Zhang, Wenbo Su, Pengjie Wang, Jian Xu, Bo Zheng, Baolin Liu
Title: MIM: Multi-modal Content Interest Modeling Paradigm for User Behavior Modeling
Abstract:
Click-Through Rate (CTR) prediction is a crucial task in recommendation systems, online searches, and advertising platforms, where accurately capturing users' real interests in content is essential for performance. However, existing methods heavily rely on ID embeddings, which fail to reflect users' true preferences for content such as images and titles. This limitation becomes particularly evident in cold-start and long-tail scenarios, where traditional approaches struggle to deliver effective results. To address these challenges, we propose a novel Multi-modal Content Interest Modeling paradigm (MIM), which consists of three key stages: Pre-training, Content-Interest-Aware Supervised Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM). The pre-training stage adapts foundational models to domain-specific data, enabling the extraction of high-quality multi-modal embeddings. The C-SFT stage bridges the semantic gap between content and user interests by leveraging user behavior signals to guide the alignment of embeddings with user preferences. Finally, the CiUBM stage integrates multi-modal embeddings and ID-based collaborative filtering signals into a unified framework. Comprehensive offline experiments and online A/B tests conducted on the Taobao, one of the world's largest e-commerce platforms, demonstrated the effectiveness and efficiency of MIM method. The method has been successfully deployed online, achieving a significant increase of +14.14% in CTR and +4.12% in RPM, showcasing its industrial applicability and substantial impact on platform performance. To promote further research, we have publicly released the code and dataset at https://pan.quark.cn/s/8fc8ec3e74f3.
中文: 提出的多模态内容兴趣建模(MIM)范式通过融合多模态嵌入和用户行为信号,解决了基于ID方法的局限性,在淘宝平台上实现了CTR和RPM的显著提升。
English: The proposed Multi-modal Content Interest Modeling (MIM) paradigm addresses limitations of ID-based methods by integrating multi-modal embeddings and user behavior signals, achieving significant improvements in CTR and RPM on Taobao's platform.

Authors:Xiangyu Zhao, Yichao Wang, Bo Chen, Jingtong Gao, Yuhao Wang, Xiaopeng Li, Pengyue Jia, Qidong Liu, Huifeng Guo, Ruiming Tang
Title: Joint Modeling in Recommendations: A Survey
Abstract:
In today's digital landscape, Deep Recommender Systems (DRS) play a crucial role in navigating and customizing online content for individual preferences. However, conventional methods, which mainly depend on single recommendation task, scenario, data modality and user behavior, are increasingly seen as insufficient due to their inability to accurately reflect users' complex and changing preferences. This gap underscores the need for joint modeling approaches, which are central to overcoming these limitations by integrating diverse tasks, scenarios, modalities, and behaviors in the recommendation process, thus promising significant enhancements in recommendation precision, efficiency, and customization. In this paper, we comprehensively survey the joint modeling methods in recommendations. We begin by defining the scope of joint modeling through four distinct dimensions: multi-task, multi-scenario, multi-modal, and multi-behavior modeling. Subsequently, we examine these methods in depth, identifying and summarizing their underlying paradigms based on the latest advancements and potential research trajectories. Ultimately, we highlight several promising avenues for future exploration in joint modeling for recommendations and provide a concise conclusion to our findings.
中文摘要:本文系统综述了深度推荐系统中的联合建模方法,通过整合多任务、多场景、多模态和多行为数据来突破传统单一维度推荐的限制,从而显著提升推荐的精准度与个性化水平。
English Summary: This paper surveys joint modeling approaches in deep recommender systems that integrate multiple tasks, scenarios, modalities, and behaviors to overcome limitations of conventional single-dimensional methods and enhance recommendation accuracy and personalization.

Authors:Tong Li, Shu Yang, Junchao Wu, Jiyao Wei, Lijie Hu, Mengdi Li, Derek F. Wong, Joshua R. Oltmanns, Di Wang
Title: Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation
Abstract:
We present a comprehensive evaluation framework for assessing Large Language Models' (LLMs) capabilities in suicide prevention, focusing on two critical aspects: the Identification of Implicit Suicidal ideation (IIS) and the Provision of Appropriate Supportive responses (PAS). We introduce \ourdata, a novel dataset of 1,308 test cases built upon psychological frameworks including D/S-IAT and Negative Automatic Thinking, alongside real-world scenarios. Through extensive experiments with 8 widely used LLMs under different contextual settings, we find that current models struggle significantly with detecting implicit suicidal ideation and providing appropriate support, highlighting crucial limitations in applying LLMs to mental health contexts. Our findings underscore the need for more sophisticated approaches in developing and evaluating LLMs for sensitive psychological applications.
中文: 本研究评估了大型语言模型在自杀预防中的应用,发现它们在识别隐性自杀意念和提供恰当支持方面存在显著困难,凸显了在心理健康应用中开发更精细方法的必要性。
English: This study evaluates large language models for suicide prevention, revealing their significant difficulties in identifying implicit suicidal thoughts and offering proper support, which points to essential limitations in mental health applications.

Authors:Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che
Title: MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering
Abstract:
Question answering on the hybrid context of tables and text (TATQA) is a critical task, with broad applications in data-intensive domains. However, existing TATQA datasets are limited to English, leading to several drawbacks: (i) They overlook the challenges of multilingual TAT-QA and cannot assess model performance in the multilingual setting. (ii) They do not reflect real-world scenarios where tables and texts frequently appear in non-English languages. To address the limitations, we propose the first multilingual TATQA dataset (MULTITAT). Specifically, we sample data from 3 mainstream TATQA datasets and translate it into 10 diverse languages. To align the model TATQA capabilities in English with other languages, we develop a baseline, Ours. Experimental results reveal that the performance on non-English data in MULTITAT drops by an average of 19.4% compared to English, proving the necessity of MULTITAT. We further analyze the reasons for this performance gap. Furthermore, Ours outperforms other baselines by an average of 3.3, demonstrating its effectiveness.
中文摘要:本研究提出首个多语言表格与文本问答数据集MULTITAT,通过将数据翻译为10种语言解决了仅限英语数据集的局限性,发现非英语语境下性能平均下降19.4%,并开发出优于其他基线模型3.3%的基准方法。
English Summary: The study introduces MULTITAT, the first multilingual dataset for table-and-text question answering, addressing the limitations of English-only datasets by translating data into 10 languages and revealing a 19.4% performance drop in non-English contexts, while proposing a baseline model that outperforms others by 3.3%.

Authors:Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, Junyang Lin
Title: AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models
Abstract:
While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice dataset. Beyond benchmark creation, this synthesis method can generate high-quality training data by incorporating program verifiers into the rejection sampling process, enabling systematic enhancement of LLMs' reasoning capabilities across diverse datasets.
中文: 现有评估大语言模型逻辑推理能力的多项选择题基准易受随机猜测影响,因此我们开发了AutoLogi,一种通过程序验证和可控难度自动生成开放式逻辑谜题的方法,能更准确评估模型能力并展现更广的性能差异。
English: Existing multiple-choice benchmarks for evaluating LLMs' logical reasoning are prone to random guessing, so we developed AutoLogi, an automated method for creating open-ended logic puzzles with program-based verification and controllable difficulty, providing more accurate assessments and revealing a wider performance range among models.

Authors:Boxun Li, Yadong Li, Zhiyuan Li, Congyi Liu, Weilin Liu, Guowei Niu, Zheyue Tan, Haiyang Xu, Zhuyu Yao, Tao Yuan, Dong Zhou, Yueqing Zhuang, Shengen Yan, Guohao Dai, Yu Wang
Title: Megrez-Omni Technical Report
Abstract:
In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.
中文: Megrez模型包括语言模型Megrez-3B-Instruct和多模态模型Megrez-3B-Omni,通过软硬件协同设计实现快速、紧凑且鲁棒的边缘智能,其中后者在图像、文本和音频分析中达到了业界领先的准确率。
English: The Megrez models, including the language model Megrez-3B-Instruct and the multimodal model Megrez-3B-Omni, are designed for fast, compact, and robust edge-side intelligence through software-hardware co-design, with the latter achieving state-of-the-art accuracy in image, text, and audio analysis.

Authors:Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan Gómez-Luna, Huawei Li, Xiaowei Li, Ying Wang, Onur Mutlu
Title: PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
Abstract:
Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8$\times$ and 11.1$\times$ speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.
中文: PAPI是一种新型异构架构,通过动态调度计算密集型或内存密集型大语言模型解码内核至合适的硬件单元,相比现有加速器实现了显著性能提升。
English: PAPI is a novel heterogeneous architecture that dynamically schedules compute-bound or memory-bound LLM decoding kernels to appropriate hardware units, achieving significant speedups over existing accelerators.

Authors:Jingheng Ye, Shang Qin, Yinghui Li, Hai-Tao Zheng, Shen Wang, Qingsong Wen
Title: Corrections Meet Explanations: A Unified Framework for Explainable Grammatical Error Correction
Abstract:
Grammatical Error Correction (GEC) faces a critical challenge concerning explainability, notably when GEC systems are designed for language learners. Existing research predominantly focuses on explaining grammatical errors extracted in advance, thus neglecting the relationship between explanations and corrections. To address this gap, we introduce EXGEC, a unified explainable GEC framework that integrates explanation and correction tasks in a generative manner, advocating that these tasks mutually reinforce each other. Experiments have been conducted on EXPECT, a recent human-labeled dataset for explainable GEC, comprising around 20k samples. Moreover, we detect significant noise within EXPECT, potentially compromising model training and evaluation. Therefore, we introduce an alternative dataset named EXPECT-denoised, ensuring a more objective framework for training and evaluation. Results on various NLP models (BART, T5, and Llama3) show that EXGEC models surpass single-task baselines in both tasks, demonstrating the effectiveness of our approach.
中文:提出的EXGEC框架以生成式方法将语法纠错与解释相结合,在去噪数据上超越单任务模型,验证了任务间的相互增强作用。
English: The proposed EXGEC framework integrates grammatical error correction with explanations in a generative approach, outperforming single-task models on denoised data and demonstrating mutual reinforcement between the tasks.

Authors:Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
Title: Dynamic Concepts Personalization from Single Videos
Abstract:
Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.
中文: Set-and-Sequence框架通过无序帧学习外观特征和序列训练捕捉运动动态的两阶段方法,实现了文本到视频模型中动态概念的个性化,为动态概念编辑设立了新标准。
English: The Set-and-Sequence framework introduces a novel two-stage approach for personalizing text-to-video models by first learning appearance through unordered frames and then capturing motion dynamics via sequential training, enabling unprecedented editability of dynamic concepts.

Authors:Suhas Gopal, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt
Title: Betsu-Betsu: Multi-View Separable 3D Reconstruction of Two Interacting Objects
Abstract:
Separable 3D reconstruction of multiple objects from multi-view RGB images -- resulting in two different 3D shapes for the two objects with a clear separation between them -- remains a sparsely researched problem. It is challenging due to severe mutual occlusions and ambiguities along the objects' interaction boundaries. This paper investigates the setting and introduces a new neuro-implicit method that can reconstruct the geometry and appearance of two objects undergoing close interactions while disjoining both in 3D, avoiding surface inter-penetrations and enabling novel-view synthesis of the observed scene. The framework is end-to-end trainable and supervised using a novel alpha-blending regularisation that ensures that the two geometries are well separated even under extreme occlusions. Our reconstruction method is markerless and can be applied to rigid as well as articulated objects. We introduce a new dataset consisting of close interactions between a human and an object and also evaluate on two scenes of humans performing martial arts. The experiments confirm the effectiveness of our framework and substantial improvements using 3D and novel view synthesis metrics compared to several existing approaches applicable in our setting.
中文: 本文提出了一种神经隐式方法,能够从多视角图像中对多个物体进行可分离的三维重建,有效处理相互遮挡并避免表面穿透,同时实现场景的新视角合成。
English: This paper introduces a neuro-implicit method for separable 3D reconstruction of multiple objects from multi-view images, effectively handling mutual occlusions and preventing surface inter-penetrations while enabling novel-view synthesis.

Authors:Ruiming Tang, Chenxu Zhu, Bo Chen, Weipeng Zhang, Menghui Zhu, Xinyi Dai, Huifeng Guo
Title: LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models
Abstract:
Tagging systems play an essential role in various information retrieval applications such as search engines and recommender systems. Recently, Large Language Models (LLMs) have been applied in tagging systems due to their extensive world knowledge, semantic understanding, and reasoning capabilities. Despite achieving remarkable performance, existing methods still have limitations, including difficulties in retrieving relevant candidate tags comprehensively, challenges in adapting to emerging domain-specific knowledge, and the lack of reliable tag confidence quantification. To address these three limitations above, we propose an automatic tagging system LLM4Tag. First, a graph-based tag recall module is designed to effectively and comprehensively construct a small-scale highly relevant candidate tag set. Subsequently, a knowledge-enhanced tag generation module is employed to generate accurate tags with long-term and short-term knowledge injection. Finally, a tag confidence calibration module is introduced to generate reliable tag confidence scores. Extensive experiments over three large-scale industrial datasets show that LLM4Tag significantly outperforms the state-of-the-art baselines and LLM4Tag has been deployed online for content tagging to serve hundreds of millions of users.
中文:LLM4Tag是一种先进的自动标注系统,通过图基召回、知识增强生成和置信度校准模块,有效解决了标注召回、领域适应性和置信度量化问题,并在大规模应用中展现出卓越性能。
English: LLM4Tag is an advanced automatic tagging system that overcomes limitations in tag recall, domain adaptability, and confidence scoring through its graph-based recall, knowledge-enhanced generation, and confidence calibration modules, demonstrating superior performance in large-scale applications.

Authors:Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang
Title: Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements
Abstract:
We introduce Fraud-R1, a benchmark designed to evaluate LLMs' ability to defend against internet fraud and phishing in dynamic, real-world scenarios. Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job postings, social media, and news, categorized into 5 major fraud types. Unlike previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to assess LLMs' resistance to fraud at different stages, including credibility building, urgency creation, and emotional manipulation. Furthermore, we evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM provides general decision-making assistance, and 2. Role-play, where the model assumes a specific persona, widely used in real-world agent-based interactions. Our evaluation reveals the significant challenges in defending against fraud and phishing inducement, especially in role-play settings and fake job postings. Additionally, we observe a substantial performance gap between Chinese and English, underscoring the need for improved multilingual fraud detection capabilities.
中文:Fraud-R1是一个包含8,564个现实世界欺诈案例的综合性基准,通过多轮评估流程揭示了大型语言模型在欺诈检测中的显著挑战,尤其在角色扮演场景和多语言环境下表现突出。
English: Fraud-R1 is a comprehensive benchmark with 8,564 real-world fraud cases across five categories, featuring a multi-round evaluation pipeline that reveals significant challenges for LLMs in fraud detection, particularly in role-play scenarios and multilingual contexts.

Authors:Yu Liang, Aofeng Shen, Chun Jason Xue, Riwei Pan, Haiyu Mao, Nika Mansouri Ghiasi, Qingcai Jiang, Rakesh Nadig, Lei Li, Rachata Ausavarungnirun, Mohammad Sadrosadati, Onur Mutlu
Title: Ariadne: A Hotness-Aware and Size-Adaptive Compressed Swap Technique for Fast Application Relaunch and Reduced CPU Usage on Mobile Devices
Abstract:
Growing application memory demands and concurrent usage are making mobile device memory scarce. When memory pressure is high, current mobile systems use a RAM-based compressed swap scheme (called ZRAM) to compress unused execution-related data (called anonymous data in Linux) in main memory. We observe that the state-of-the-art ZRAM scheme prolongs relaunch latency and wastes CPU time because it does not differentiate between hot and cold data or leverage different compression chunk sizes and data locality. We make three new observations. 1) anonymous data has different levels of hotness. Hot data, used during application relaunch, is usually similar between consecutive relaunches. 2) when compressing the same amount of anonymous data, small-size compression is very fast, while large-size compression achieves a better compression ratio. 3) there is locality in data access during application relaunch. We propose Ariadne, a compressed swap scheme for mobile devices that reduces relaunch latency and CPU usage with three key techniques. 1) a low-overhead hotness-aware data organization scheme aims to quickly identify the hotness of anonymous data without significant overhead. 2) a size-adaptive compression scheme uses different compression chunk sizes based on the data's hotness level to ensure fast decompression of hot and warm data. 3) a proactive decompression scheme predicts the next set of data to be used and decompresses it in advance, reducing the impact of data swapping back into main memory during application relaunch. Our experimental evaluation results on Google Pixel 7 show that, on average, Ariadne reduces application relaunch latency by 50% and decreases the CPU usage of compression and decompression procedures by 15% compared to the state-of-the-art ZRAM scheme.
Chinese: Ariadne是一种创新的移动设备压缩交换方案,通过热感知数据组织、自适应大小压缩和主动解压技术,将应用重启延迟降低50%,CPU使用率减少15%,显著优于现有的ZRAM方案。
English: Ariadne is a novel compressed swap scheme for mobile devices that reduces application relaunch latency by 50% and CPU usage by 15% through hotness-aware data organization, size-adaptive compression, and proactive decompression, outperforming the current ZRAM system.

Authors:Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang
Title: Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards
Abstract:
As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain their policies in natural language will be vital for reliable coexistence. In this paper, we build a model-agnostic explanation generator based on an LLM. The technical novelty is that the rewards for training this LLM are generated by a generative flow matching model. This model has a specially designed structure with a hidden layer merged with an LLM to harness the linguistic cues of explanations into generating appropriate rewards. Experiments on both RL and LLM tasks demonstrate that our method can generate dense and effective rewards while saving on expensive human feedback; it thus enables effective explanations and even improves the accuracy of the decisions in original tasks.
中文: 本文提出了一种基于大语言模型的模型无关解释生成器,通过结合语言线索的生成流匹配模型产生训练奖励,无需昂贵的人工反馈即可生成密集有效的奖励,从而提升解释效果和决策准确性。
English: This paper introduces a model-agnostic explanation generator using an LLM, which is trained with rewards produced by a generative flow matching model that incorporates linguistic cues to create dense and effective rewards, reducing the need for costly human feedback while improving explanation quality and decision accuracy.

Authors:Hao Liu, Zhengren Wang, Xi Chen, Zhiyu Li, Feiyu Xiong, Qinhan Yu, Wentao Zhang
Title: HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) systems often struggle with imperfect retrieval, as traditional retrievers focus on lexical or semantic similarity rather than logical relevance. To address this, we propose \textbf{HopRAG}, a novel RAG framework that augments retrieval with logical reasoning through graph-structured knowledge exploration. During indexing, HopRAG constructs a passage graph, with text chunks as vertices and logical connections established via LLM-generated pseudo-queries as edges. During retrieval, it employs a \textit{retrieve-reason-prune} mechanism: starting with lexically or semantically similar passages, the system explores multi-hop neighbors guided by pseudo-queries and LLM reasoning to identify truly relevant ones. Experiments on multiple multi-hop benchmarks demonstrate that HopRAG's \textit{retrieve-reason-prune} mechanism can expand the retrieval scope based on logical connections and improve final answer quality.
中文: HopRAG通过图结构的知识探索和检索-推理-剪枝机制,在检索增强生成中融入逻辑推理,有效提升了多跳任务中的检索相关性和答案质量。
English: HopRAG enhances retrieval-augmented generation by incorporating logical reasoning through graph-based knowledge exploration, using a retrieve-reason-prune mechanism to improve relevance and answer quality in multi-hop benchmarks.

Authors:Yunlong Feng, Bohan Li, Xiaoming Shi, Qingfu Zhu, Wanxiang Che
Title: ReF Decompile: Relabeling and Function Call Enhanced Decompile
Abstract:
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages, enabling analysis in scenarios where source code is unavailable. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration. The end-to-end decompile method based on large langauge models (LLMs) reduces reliance on additional tools and minimizes manual intervention due to its inherent properties. However, previous end-to-end methods often lose critical information necessary for reconstructing control flow structures and variables when processing binary files, making it challenging to accurately recover the program's logic. To address these issues, we propose the \textbf{ReF Decompile} method, which incorporates the following innovations: (1) The Relabelling strategy replaces jump target addresses with labels, preserving control flow clarity. (2) The Function Call strategy infers variable types and retrieves missing variable information from binary files. Experimental results on the Humaneval-Decompile Benchmark demonstrate that ReF Decompile surpasses comparable baselines and achieves state-of-the-art (SOTA) performance of $61.43\%$.
中文摘要:ReF 反编译方法通过重标记和函数调用策略,有效保留控制流结构并恢复变量信息,在基准测试中以61.43%的准确率实现最优性能。
English Summary: The ReF Decompile method enhances decompilation by using relabelling and function call strategies to preserve control flow structures and recover variable information, achieving state-of-the-art performance of 61.43% on benchmarks.

Authors:Chendong Wang, Anlan Zhang, Yifan Yang, Lili Qiu, Yuqing Yang, Xinyang Jiang, Feng Qian, Suman Banerjee
Title: VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution
Abstract:
3D volumetric video provides immersive experience and is gaining traction in digital media. Despite its rising popularity, the streaming of volumetric video content poses significant challenges due to the high data bandwidth requirement. A natural approach to mitigate the bandwidth issue is to reduce the volumetric video's data rate by downsampling the content prior to transmission. The video can then be upsampled at the receiver's end using a super-resolution (SR) algorithm to reconstruct the high-resolution details. While super-resolution techniques have been extensively explored and advanced for 2D video content, there is limited work on SR algorithms tailored for volumetric videos. To address this gap and the growing need for efficient volumetric video streaming, we have developed VoLUT with a new SR algorithm specifically designed for volumetric content. Our algorithm uniquely harnesses the power of lookup tables (LUTs) to facilitate the efficient and accurate upscaling of low-resolution volumetric data. The use of LUTs enables our algorithm to quickly reference precomputed high-resolution values, thereby significantly reducing the computational complexity and time required for upscaling. We further apply adaptive video bit rate algorithm (ABR) to dynamically determine the downsampling rate according to the network condition and stream the selected video rate to the receiver. Compared to related work, VoLUT is the first to enable high-quality 3D SR on commodity mobile devices at line-rate. Our evaluation shows VoLUT can reduce bandwidth usage by 70% , boost QoE by 36.7% for volumetric video streaming and achieve 3D SR speed-up with no quality compromise.
中文: VoLUT采用创新的查找表超分辨率算法,有效提升低分辨率立体视频质量,在移动设备上实现带宽降低70%和流媒体体验优化,且无性能损失。
English: VoLUT introduces a novel super-resolution algorithm using lookup tables to efficiently upscale low-resolution volumetric video, reducing bandwidth by 70% and enhancing streaming quality on mobile devices without compromising performance.

Authors:Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
Title: DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation
Abstract:
In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. While existing video generative models apply fixed compression rates via pretrained VAE, we observe that real-world video content exhibits substantial temporal non-uniformity, with high-motion segments containing more information than static scenes. Based on this insight, DLFR-VAE dynamically adjusts the latent frame rate according to the content complexity. Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent Frame Rate Scheduler that partitions videos into temporal chunks and adaptively determines optimal frame rates based on information-theoretic content complexity, and (2) A training-free adaptation mechanism that transforms pretrained VAE architectures into a dynamic VAE that can process features with variable frame rates. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models and accelerating the video generation process.
中文: 本文提出DLFR-VAE,这是一种无需训练的方法,能根据视频内容复杂度动态调整潜在帧率,以提升时序压缩效率并加速视频生成过程。
English: The paper introduces DLFR-VAE, a training-free method that dynamically adjusts latent frame rates based on video content complexity to enhance temporal compression efficiency and accelerate video generation.

Authors:Deqing Zou, Jingheng Ye, Yulu Liu, Yu Wu, Zishan Xu, Yinghui Li, Hai-Tao Zheng, Bingxu An, Zhao Wei, Yong Xu
Title: Revisiting Classification Taxonomy for Grammatical Errors
Abstract:
Grammatical error classification plays a crucial role in language learning systems, but existing classification taxonomies often lack rigorous validation, leading to inconsistencies and unreliable feedback. In this paper, we revisit previous classification taxonomies for grammatical errors by introducing a systematic and qualitative evaluation framework. Our approach examines four aspects of a taxonomy, i.e., exclusivity, coverage, balance, and usability. Then, we construct a high-quality grammatical error classification dataset annotated with multiple classification taxonomies and evaluate them grounding on our proposed evaluation framework. Our experiments reveal the drawbacks of existing taxonomies. Our contributions aim to improve the precision and effectiveness of error analysis, providing more understandable and actionable feedback for language learners.
中文: 本文提出一个系统性框架来评估语法错误分类体系,揭示现有分类的不足,旨在为语言学习者提供更精确和实用的反馈。
English: This paper introduces a systematic framework to evaluate grammatical error classification taxonomies, revealing their shortcomings and contributing to more precise and actionable feedback for language learners.

Authors:Jiaru Zhang, Rui Ding, Qiang Fu, Bojun Huang, Zizhen Deng, Yang Hua, Haibing Guan, Shi Han, Dongmei Zhang
Title: Learning Identifiable Structures Helps Avoid Bias in DNN-based Supervised Causal Learning
Abstract:
Causal discovery is a structured prediction task that aims to predict causal relations among variables based on their data samples. Supervised Causal Learning (SCL) is an emerging paradigm in this field. Existing Deep Neural Network (DNN)-based methods commonly adopt the "Node-Edge approach", in which the model first computes an embedding vector for each variable-node, then uses these variable-wise representations to concurrently and independently predict for each directed causal-edge. In this paper, we first show that this architecture has some systematic bias that cannot be mitigated regardless of model size and data size. We then propose SiCL, a DNN-based SCL method that predicts a skeleton matrix together with a v-tensor (a third-order tensor representing the v-structures). According to the Markov Equivalence Class (MEC) theory, both the skeleton and the v-structures are identifiable causal structures under the canonical MEC setting, so predictions about skeleton and v-structures do not suffer from the identifiability limit in causal discovery, thus SiCL can avoid the systematic bias in Node-Edge architecture, and enable consistent estimators for causal discovery. Moreover, SiCL is also equipped with a specially designed pairwise encoder module with a unidirectional attention layer to model both internal and external relationships of pairs of nodes. Experimental results on both synthetic and real-world benchmarks show that SiCL significantly outperforms other DNN-based SCL approaches.
中文:本文提出SiCL方法,一种基于深度神经网络的监督因果学习方法,通过预测可识别的因果结构(骨架矩阵和V-结构张量)来克服现有节点-边缘架构的系统性偏差,实现了因果发现的一致性估计,并在基准测试中展现出卓越性能。
English: This paper introduces SiCL, a Deep Neural Network-based Supervised Causal Learning method that overcomes systematic bias in existing Node-Edge approaches by predicting identifiable causal structures—skeleton matrices and v-tensors—thereby enabling consistent estimators and demonstrating superior performance on benchmarks.

Authors:Yinghui Li, Jiayi Kuang, Haojing Huang, Zhikun Xu, Xinnian Liang, Yi Yu, Wenlian Lu, Yangning Li, Xiaoyu Tan, Chao Qu, Ying Shen, Hai-Tao Zheng, Philip S. Yu
Title: One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs
Abstract:
Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs' ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, CounterMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that CounterMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.
中文摘要:本研究提出CounterMATH基准,通过反例证明增强数学大语言模型的推理能力,揭示了现有模型的局限性,并强调反例驱动训练对于深化数学理解的重要性。
English Summary: This study introduces CounterMATH, a benchmark for enhancing mathematical LLMs' reasoning through counterexample-based proofs, revealing current models' limitations and emphasizing the importance of counterexample-driven training for deeper mathematical understanding.

Authors:Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang
Title: Region-Adaptive Sampling for Diffusion Transformers
Abstract:
Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.
中文: 扩散变换器(DiTs)实现了名为RAS的无训练采样新策略,该策略在生成过程中动态聚焦于图像的语义重要区域,通过重用非关键区域的缓存噪声,在最小质量损失下实现了高达2.5倍的加速效果。
English: Diffusion Transformers (DiTs) enable a novel training-free sampling strategy called RAS, which dynamically focuses on semantically important image regions during generation, achieving up to 2.5x speedup with minimal quality loss by reusing cached noise for non-critical areas.

Authors:Lin Zhang, Lijie Hu, Di Wang
Title: Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning
Abstract:
Transformer-based language models have achieved significant success; however, their internal mechanisms remain largely opaque due to the complexity of non-linear interactions and high-dimensional operations. While previous studies have demonstrated that these models implicitly embed reasoning trees, humans typically employ various distinct logical reasoning mechanisms to complete the same task. It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks. In this paper, we aim to address this question by investigating the mechanistic interpretability of language models, particularly in the context of multi-step reasoning tasks. Specifically, we employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process, allowing us to map the reasoning paths adopted by the model. We apply this methodology to the GPT-2 model on a prediction task (IOI) and demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.
Chinese: 本研究通过电路分析和自影响函数探究基于Transformer的语言模型的机制可解释性,揭示了GPT-2在多步推理任务中采用的人类可理解的推理路径。
English: This study investigates the mechanistic interpretability of transformer-based language models, using circuit analysis and self-influence functions to map their reasoning paths and revealing human-interpretable processes in GPT-2's multi-step reasoning tasks.

Authors:Christen Millerdurai, Hiroyasu Akada, Jian Wang, Diogo Luvizon, Alain Pagani, Didier Stricker, Christian Theobalt, Vladislav Golyanik
Title: EventEgo3D++: 3D Human Motion Capture from a Head-Mounted Event Camera
Abstract:
Monocular egocentric 3D human motion capture remains a significant challenge, particularly under conditions of low lighting and fast movements, which are common in head-mounted device applications. Existing methods that rely on RGB cameras often fail under these conditions. To address these limitations, we introduce EventEgo3D++, the first approach that leverages a monocular event camera with a fisheye lens for 3D human motion capture. Event cameras excel in high-speed scenarios and varying illumination due to their high temporal resolution, providing reliable cues for accurate 3D human motion capture. EventEgo3D++ leverages the LNES representation of event streams to enable precise 3D reconstructions. We have also developed a mobile head-mounted device (HMD) prototype equipped with an event camera, capturing a comprehensive dataset that includes real event observations from both controlled studio environments and in-the-wild settings, in addition to a synthetic dataset. Additionally, to provide a more holistic dataset, we include allocentric RGB streams that offer different perspectives of the HMD wearer, along with their corresponding SMPL body model. Our experiments demonstrate that EventEgo3D++ achieves superior 3D accuracy and robustness compared to existing solutions, even in challenging conditions. Moreover, our method supports real-time 3D pose updates at a rate of 140Hz. This work is an extension of the EventEgo3D approach (CVPR 2024) and further advances the state of the art in egocentric 3D human motion capture. For more details, visit the project page at https://eventego3d.mpi-inf.mpg.de.
中文:EventEgo3D++首次采用单目事件相机实现3D人体运动捕捉,在弱光与高速场景下表现卓越,具备140Hz实时更新能力与更高精度。
English: EventEgo3D++ introduces the first monocular event camera-based method for 3D human motion capture, excelling in low light and fast motion with high accuracy and real-time 140Hz performance.

Authors:Yinghui Li, Haojing Huang, Jiayi Kuang, Yangning Li, Shu-Yu Guo, Chao Qu, Xiaoyu Tan, Hai-Tao Zheng, Ying Shen, Philip S. Yu
Title: Refine Knowledge of Large Language Models via Adaptive Contrastive Learning
Abstract:
How to alleviate the hallucinations of Large Language Models (LLMs) has always been the fundamental goal pursued by the LLMs research community. Looking through numerous hallucination-related studies, a mainstream category of methods is to reduce hallucinations by optimizing the knowledge representation of LLMs to change their output. Considering that the core focus of these works is the knowledge acquired by models, and knowledge has long been a central theme in human societal progress, we believe that the process of models refining knowledge can greatly benefit from the way humans learn. In our work, by imitating the human learning process, we design an Adaptive Contrastive Learning strategy. Our method flexibly constructs different positive and negative samples for contrastive learning based on LLMs' actual mastery of knowledge. This strategy helps LLMs consolidate the correct knowledge they already possess, deepen their understanding of the correct knowledge they have encountered but not fully grasped, forget the incorrect knowledge they previously learned, and honestly acknowledge the knowledge they lack. Extensive experiments and detailed analyses on widely used datasets demonstrate the effectiveness of our method.
中文: 本研究提出了一种自适应对比学习策略,通过模拟人类学习过程,帮助大语言模型巩固已有正确知识、深化理解未完全掌握概念、摒弃错误信息并承认知识盲区,经广泛实验验证能有效缓解模型幻觉问题。
English: This study introduces an Adaptive Contrastive Learning strategy that mimics human learning to help LLMs consolidate correct knowledge, deepen understanding of partially grasped concepts, discard incorrect information, and acknowledge knowledge gaps, effectively reducing hallucinations as validated by extensive experiments.

Authors:Lin Zhang, Wenshuo Dong, Zhuoran Zhang, Shu Yang, Lijie Hu, Ninghao Liu, Pan Zhou, Di Wang
Title: EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification
Abstract:
Understanding the internal mechanisms of transformer-based language models remains challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer neural networks by analyzing their internal processes at the level of computational subgraphs. In this paper, we revisit existing gradient-based circuit identification methods and find that their performance is either affected by the zero-gradient problem or saturation effects, where edge attribution scores become insensitive to input changes, resulting in noisy and unreliable attribution evaluations for circuit components. To address the saturation effect, we propose Edge Attribution Patching with GradPath (EAP-GP), EAP-GP introduces an integration path, starting from the input and adaptively following the direction of the difference between the gradients of corrupted and clean inputs to avoid the saturated region. This approach enhances attribution reliability and improves the faithfulness of circuit identification. We evaluate EAP-GP on 6 datasets using GPT-2 Small, GPT-2 Medium, and GPT-2 XL. Experimental results demonstrate that EAP-GP outperforms existing methods in circuit faithfulness, achieving improvements up to 17.7%. Comparisons with manually annotated ground-truth circuits demonstrate that EAP-GP achieves precision and recall comparable to or better than previous approaches, highlighting its effectiveness in identifying accurate circuits.
中文摘要:本文提出基于梯度路径的边缘归因修补方法(EAP-GP),通过自适应追踪梯度差异路径来克服电路识别中的梯度饱和效应,在多个数据集上实现最高17.7%的忠实度提升,显著优于现有方法。
English Summary: This paper introduces Edge Attribution Patching with GradPath (EAP-GP), a novel method that overcomes gradient saturation effects in circuit identification by adaptively following gradient difference paths, achieving up to 17.7% improvement in faithfulness over existing approaches.

Authors:Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, Jianfeng Gao
Title: On Memory Construction and Retrieval for Personalized Conversational Agents
Abstract:
To deliver coherent and personalized experiences in long-term conversations, existing approaches typically perform retrieval augmented response generation by constructing memory banks from conversation history at either the turn-level, session-level, or through summarization techniques.In this paper, we present two key findings: (1) The granularity of memory unit matters: turn-level, session-level, and summarization-based methods each exhibit limitations in both memory retrieval accuracy and the semantic quality of the retrieved content. (2) Prompt compression methods, such as LLMLingua-2, can effectively serve as a denoising mechanism, enhancing memory retrieval accuracy across different granularities. Building on these insights, we propose SeCom, a method that constructs the memory bank at segment level by introducing a conversation segmentation model that partitions long-term conversations into topically coherent segments, while applying compression based denoising on memory units to enhance memory retrieval. Experimental results show that SeCom exhibits a significant performance advantage over baselines on long-term conversation benchmarks LOCOMO and Long-MT-Bench+. Additionally, the proposed conversation segmentation method demonstrates superior performance on dialogue segmentation datasets such as DialSeg711, TIAGE, and SuperDialSeg.
中文: 现有长期对话方法采用不同粒度的记忆库,但各自在检索准确性和语义质量上存在局限,而提示压缩能提升检索效果;提出的SeCom方法通过将对话分割为连贯主题并应用基于压缩的去噪,显著提高了性能。
English: Existing methods for long-term conversations use memory banks at different granularities, but each has limitations in retrieval accuracy and semantic quality, while prompt compression can enhance retrieval; the proposed SeCom method improves performance by segmenting conversations into coherent topics and applying compression-based denoising.

Authors:Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, Yu Wang
Title: Learning Strategic Language Agents in the Werewolf Game with Iterative Latent Space Policy Optimization
Abstract:
Large language model (LLM) agents have recently demonstrated impressive capabilities in various domains like open-ended conversation and multi-step decision-making. However, it remains challenging for these agents to solve strategic language games, such as Werewolf, which demand both strategic decision-making and free-form language interactions. Existing LLM agents often suffer from intrinsic bias in their action distributions and limited exploration of the unbounded text action space, resulting in suboptimal performance. To address these challenges, we propose Latent Space Policy Optimization (LSPO), an iterative framework that combines game-theoretic methods with LLM fine-tuning to build strategic language agents. LSPO leverages the observation that while the language space is combinatorially large, the underlying strategy space is relatively compact. We first map free-form utterances into a finite latent strategy space, yielding an abstracted extensive-form game. Then we apply game-theoretic methods like Counterfactual Regret Minimization (CFR) to optimize the policy in the latent space. Finally, we fine-tune the LLM via Direct Preference Optimization (DPO) to align with the learned policy. By iteratively alternating between these steps, our LSPO agents progressively enhance both strategic reasoning and language communication. Experiment on the Werewolf game shows that our agents iteratively expand the strategy space with improving performance and outperform existing Werewolf agents, underscoring their effectiveness in free-form language games with strategic interactions.
中文: 提出的潜在空间策略优化(LSPO)框架将博弈论与大语言模型微调相结合,有效提升了语言游戏中的策略推理能力,在狼人杀游戏中表现优于现有智能体。
English: The proposed Latent Space Policy Optimization (LSPO) framework combines game theory with LLM fine-tuning to enhance strategic reasoning in language games, demonstrating superior performance in Werewolf compared to existing agents.

Authors:Parth Atulbhai Gandhi, Prasanna N. Wudali, Yonatan Amaru, Yuval Elovici, Asaf Shabtai
Title: SHIELD: APT Detection and Intelligent Explanation Using LLM
Abstract:
Advanced persistent threats (APTs) are sophisticated cyber attacks that can remain undetected for extended periods, making their mitigation particularly challenging. Given their persistence, significant effort is required to detect them and respond effectively. Existing provenance-based attack detection methods often lack interpretability and suffer from high false positive rates, while investigation approaches are either supervised or limited to known attacks. To address these challenges, we introduce SHIELD, a novel approach that combines statistical anomaly detection and graph-based analysis with the contextual analysis capabilities of large language models (LLMs). SHIELD leverages the implicit knowledge of LLMs to uncover hidden attack patterns in provenance data, while reducing false positives and providing clear, interpretable attack descriptions. This reduces analysts' alert fatigue and makes it easier for them to understand the threat landscape. Our extensive evaluation demonstrates SHIELD's effectiveness and computational efficiency in real-world scenarios. SHIELD was shown to outperform state-of-the-art methods, achieving higher precision and recall. SHIELD's integration of anomaly detection, LLM-driven contextual analysis, and advanced graph-based correlation establishes a new benchmark for APT detection.
中文: SHIELD是一种新型高级持续性威胁检测系统,它融合统计异常检测、图分析与大语言模型,能够以高精度揭示隐蔽攻击模式,显著降低误报率并提供清晰可解释的分析结果。
English: SHIELD is a novel APT detection system that integrates statistical anomaly detection, graph-based analysis, and large language models to uncover hidden attack patterns with high precision, reduced false positives, and clear interpretability.

Authors:Prasanna N. Wudali, Moshe Kravchik, Ehud Malul, Parth A. Gandhi, Yuval Elovici, Asaf Shabtai
Title: Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs Using LLMs
Abstract:
The growing frequency of cyberattacks has heightened the demand for accurate and efficient threat detection systems. SIEM platforms are important for analyzing log data and detecting adversarial activities through rule-based queries, also known as SIEM rules. The efficiency of the threat analysis process relies heavily on mapping these SIEM rules to the relevant attack techniques in the MITRE ATT&CK framework. Inaccurate annotation of SIEM rules can result in the misinterpretation of attacks, increasing the likelihood that threats will be overlooked. Existing solutions for annotating SIEM rules with MITRE ATT&CK technique labels have notable limitations: manual annotation of SIEM rules is both time-consuming and prone to errors, and ML-based approaches mainly focus on annotating unstructured free text sources rather than structured data like SIEM rules. Structured data often contains limited information, further complicating the annotation process and making it a challenging task. To address these challenges, we propose Rule-ATT&CK Mapper (RAM), a novel framework that leverages LLMs to automate the mapping of structured SIEM rules to MITRE ATT&CK techniques. RAM's multi-stage pipeline, which was inspired by the prompt chaining technique, enhances mapping accuracy without requiring LLM pre-training or fine-tuning. Using the Splunk Security Content dataset, we evaluate RAM's performance using several LLMs, including GPT-4-Turbo, Qwen, IBM Granite, and Mistral. Our evaluation highlights GPT-4-Turbo's superior performance, which derives from its enriched knowledge base, and an ablation study emphasizes the importance of external contextual knowledge in overcoming the limitations of LLMs' implicit knowledge for domain-specific tasks. These findings demonstrate RAM's potential in automating cybersecurity workflows and provide valuable insights for future advancements in this field.
中文: 网络攻击日益频繁,亟需精准威胁检测;提出的Rule-ATT&CK Mapper (RAM)框架利用大语言模型自动将结构化SIEM规则映射至MITRE ATT&CK攻击技术,无需预训练或微调即可提升准确性。
English: The increasing prevalence of cyberattacks necessitates precise threat detection, and the proposed Rule-ATT&CK Mapper (RAM) framework effectively automates the mapping of structured SIEM rules to MITRE ATT&CK techniques using LLMs, enhancing accuracy without requiring model pre-training or fine-tuning.

Authors:Zelai Xu, Ruize Zhang, Chao Yu, Huining Yuan, Xiangmin Yi, Shilong Ji, Chuqi Wang, Wenhao Tang, Feng Gao, Wenbo Ding, Xinlei Chen, Yu Wang
Title: VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play
Abstract:
Robot sports, characterized by well-defined objectives, explicit rules, and dynamic interactions, present ideal scenarios for demonstrating embodied intelligence. In this paper, we present VolleyBots, a novel robot sports testbed where multiple drones cooperate and compete in the sport of volleyball under physical dynamics. VolleyBots integrates three features within a unified platform: competitive and cooperative gameplay, turn-based interaction structure, and agile 3D maneuvering. Competitive and cooperative gameplay challenges each drone to coordinate with its teammates while anticipating and countering opposing teams' tactics. Turn-based interaction demands precise timing, accurate state prediction, and management of long-horizon temporal dependencies. Agile 3D maneuvering requires rapid accelerations, sharp turns, and precise 3D positioning despite the quadrotor's underactuated dynamics. These intertwined features yield a complex problem combining motion control and strategic play, with no available expert demonstrations. We provide a comprehensive suite of tasks ranging from single-drone drills to multi-drone cooperative and competitive tasks, accompanied by baseline evaluations of representative multi-agent reinforcement learning (MARL) and game-theoretic algorithms. Simulation results show that on-policy reinforcement learning (RL) methods outperform off-policy methods in single-agent tasks, but both approaches struggle in complex tasks that combine motion control and strategic play. We additionally design a hierarchical policy which achieves a 69.5% percent win rate against the strongest baseline in the 3 vs 3 task, underscoring its potential as an effective solution for tackling the complex interplay between low-level control and high-level strategy. The project page is at https://sites.google.com/view/thu-volleybots.
中文: VolleyBots是一个创新的无人机机器人运动测试平台,通过排球比赛整合了竞争合作机制、回合制交互和敏捷三维机动能力来研究具身智能,其分层策略在3对3任务中取得了69.5%的胜率。
English: VolleyBots is a novel drone-based robot sports testbed that integrates competitive-cooperative gameplay, turn-based interactions, and agile 3D maneuvering to study embodied intelligence through volleyball matches, with a hierarchical policy achieving a 69.5% win rate in 3v3 scenarios.

Authors:Jijia Liu, Feng Gao, Qingmin Liao, Chao Yu, Yu Wang
Title: Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network
Abstract:
Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods can mitigate this burden by offering relatively high sample efficiency. Some studies further enhance sample efficiency by incorporating offline demonstration data to "kick-start" training, achieving promising results in continuous control. However, they typically compute the Q-function independently for each action dimension, neglecting interdependencies and making it harder to identify optimal actions when learning from suboptimal data, such as non-expert demonstration and online-collected data during the training process. To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner. First, ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks. Next, it auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks. We evaluate ARSQ on two continuous control benchmarks, RLBench and D4RL, integrating demonstration data into online training. On D4RL, which includes non-expert demonstrations, ARSQ achieves an average $1.62\times$ performance improvement over SOTA value-based baseline. On RLBench, which incorporates expert demonstrations, ARSQ surpasses various baselines, demonstrating its effectiveness in learning from suboptimal online-collected data. Project page is at https://sites.google.com/view/ar-soft-q
中文摘要:强化学习在连续控制任务中通过ARSQ算法得到改进,该算法以自回归方式建模Q值,提升了样本效率和决策能力,在包含专家和非专家演示的基准测试中均优于现有最优方法。
English Summary: Reinforcement learning for continuous control is enhanced by ARSQ, a value-based algorithm that models Q-values auto-regressively to improve sample efficiency and decision-making, outperforming state-of-the-art methods on benchmarks with both expert and non-expert demonstrations.

Authors:Lida Zhao, Shihan Dou, Yutao Hu, Yueming Wu, Jiahui Wu, Chengwei Liu, Lyuye Zhang, Yi Liu, Jun Sun, Xuanjing Huang, Yang Liu
Title: Detecting Essence Code Clones via Information Theoretic Analysis
Abstract:
Code cloning, a widespread practice in software development, involves replicating code fragments to save time but often at the expense of software maintainability and quality. In this paper, we address the specific challenge of detecting "essence clones", a complex subtype of Type-3 clones characterized by sharing critical logic despite different peripheral codes. Traditional techniques often fail to detect essence clones due to their syntactic focus. To overcome this limitation, we introduce ECScan, a novel detection tool that leverages information theory to assess the semantic importance of code lines. By assigning weights to each line based on its information content, ECScan emphasizes core logic over peripheral code differences. Our comprehensive evaluation across various real-world projects shows that ECScan significantly outperforms existing tools in detecting essence clones, achieving an average F1-score of 85%. It demonstrates robust performance across all clone types and offers exceptional scalability. This study advances clone detection by providing a practical tool for developers to enhance code quality and reduce maintenance burdens, emphasizing the semantic aspects of code through an innovative information-theoretic approach.
中文: 本文提出ECScan工具,利用信息论评估代码行的语义重要性来检测"本质克隆",其平均F1值达85%显著优于现有方法,能有效提升代码可维护性。
English: This paper introduces ECScan, a novel tool using information theory to detect "essence clones" by weighing code lines' semantic importance, which significantly outperforms existing methods with an 85% average F1-score and enhances software maintainability.

Authors:Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, Holger Voos
Title: S-Graphs 2.0 -- A Hierarchical-Semantic Optimization and Loop Closure for SLAM
Abstract:
The hierarchical structure of 3D scene graphs shows a high relevance for representations purposes, as it fits common patterns from man-made environments. But, additionally, the semantic and geometric information in such hierarchical representations could be leveraged to speed up the optimization and management of map elements and robot poses. In this direction, we present our work Situational Graphs 2.0 (S-Graphs 2.0), which leverages the hierarchical structure of indoor scenes for efficient data management and optimization. Our algorithm begins by constructing a situational graph that represents the environment into four layers: Keyframes, Walls, Rooms, and Floors. Our first novelty lies in the front-end, which includes a floor detection module capable of identifying stairways and assigning floor-level semantic relations to the underlying layers. Floor-level semantics allows us to propose a floor-based loop closure strategy, that effectively rejects false positive closures that typically appear due to aliasing between different floors of a building. Our second novelty lies in leveraging our representation hierarchy in the optimization. Our proposal consists of: (1) local optimization over a window of recent keyframes and their connected components across the four representation layers, (2) floor-level global optimization, which focuses only on keyframes and their connections within the current floor during loop closures, and (3) room-level local optimization, marginalizing redundant keyframes that share observations within the room, which reduces the computational footprint. We validate our algorithm extensively in different real multi-floor environments. Our approach shows state-of-art-art accuracy metrics in large-scale multi-floor environments, estimating hierarchical representations up to 10x faster, in average, than competing baselines
中文:提出的S-Graphs 2.0通过利用场景分层结构,结合创新的楼层检测和多层级优化策略,实现了更快速的环境建图,在多楼层环境中平均比基线方法快10倍。
English: The proposed S-Graphs 2.0 leverages hierarchical scene structures to enable faster environmental mapping through novel floor detection and multi-level optimization strategies, achieving up to 10x speed improvement in multi-floor environments.

Authors:Hriday Bavle, Jose Luis Sanchez-Lopez, Muhammad Shaheer, Javier Civera, Holger Voos
Title: S-Graphs 2.0 -- A Hierarchical-Semantic Optimization and Loop Closure for SLAM
Abstract:
The hierarchical structure of 3D scene graphs shows a high relevance for representations purposes, as it fits common patterns from man-made environments. But, additionally, the semantic and geometric information in such hierarchical representations could be leveraged to speed up the optimization and management of map elements and robot poses. In this direction, we present our work Situational Graphs 2.0 (S-Graphs 2.0), which leverages the hierarchical structure of indoor scenes for efficient data management and optimization. Our algorithm begins by constructing a situational graph that represents the environment into four layers: Keyframes, Walls, Rooms, and Floors. Our first novelty lies in the front-end, which includes a floor detection module capable of identifying stairways and assigning floor-level semantic relations to the underlying layers. Floor-level semantics allows us to propose a floor-based loop closure strategy, that effectively rejects false positive closures that typically appear due to aliasing between different floors of a building. Our second novelty lies in leveraging our representation hierarchy in the optimization. Our proposal consists of: (1) local optimization over a window of recent keyframes and their connected components across the four representation layers, (2) floor-level global optimization, which focuses only on keyframes and their connections within the current floor during loop closures, and (3) room-level local optimization, marginalizing redundant keyframes that share observations within the room, which reduces the computational footprint. We validate our algorithm extensively in different real multi-floor environments. Our approach shows state-of-art-art accuracy metrics in large-scale multi-floor environments, estimating hierarchical representations up to 10x faster, in average, than competing baselines
中文:提出的S-Graphs 2.0通过利用场景分层结构,结合创新的楼层检测和多层级优化策略,实现了更快速的环境建图,在多楼层环境中平均比基线方法快10倍。
English: The proposed S-Graphs 2.0 leverages hierarchical scene structures to enable faster environmental mapping through novel floor detection and multi-level optimization strategies, achieving up to 10x speed improvement in multi-floor environments.

Authors:Xinhang Liu, Chi-Keung Tang, Yu-Wing Tai
Title: WorldCraft: Photo-Realistic 3D World Creation and Customization via LLM Agents
Abstract:
Constructing photorealistic virtual worlds has applications across various fields, but it often requires the extensive labor of highly trained professionals to operate conventional 3D modeling software. To democratize this process, we introduce WorldCraft, a system where large language model (LLM) agents leverage procedural generation to create indoor and outdoor scenes populated with objects, allowing users to control individual object attributes and the scene layout using intuitive natural language commands. In our framework, a coordinator agent manages the overall process and works with two specialized LLM agents to complete the scene creation: ForgeIt, which integrates an ever-growing manual through auto-verification to enable precise customization of individual objects, and ArrangeIt, which formulates hierarchical optimization problems to achieve a layout that balances ergonomic and aesthetic considerations. Additionally, our pipeline incorporates a trajectory control agent, allowing users to animate the scene and operate the camera through natural language interactions. Our system is also compatible with off-the-shelf deep 3D generators to enrich scene assets. Through evaluations and comparisons with state-of-the-art methods, we demonstrate the versatility of WorldCraft, ranging from single-object customization to intricate, large-scale interior and exterior scene designs. This system empowers non-professionals to bring their creative visions to life.
中文: WorldCraft系统利用大语言模型代理和程序化生成技术,通过自然语言指令让非专业用户能够创建和定制详细的室内外场景,使逼真虚拟世界的构建变得大众化。
English: WorldCraft is a system that uses large language model agents and procedural generation to enable users to create and customize detailed indoor and outdoor scenes through natural language commands, making photorealistic virtual world construction accessible to non-professionals.

Authors:Yue Huang, Chujie Gao, Siyuan Wu, Haoran Wang, Xiangqi Wang, Yujun Zhou, Yanbo Wang, Jiayi Ye, Jiawen Shi, Qihui Zhang, Yuan Li, Han Bao, Zhaoyi Liu, Tianrui Guan, Dongping Chen, Ruoxi Chen, Kehan Guo, Andy Zou, Bryan Hooi Kuen-Yew, Caiming Xiong, Elias Stengel-Eskin, Hongyang Zhang, Hongzhi Yin, Huan Zhang, Huaxiu Yao, Jaehong Yoon, Jieyu Zhang, Kai Shu, Kaijie Zhu, Ranjay Krishna, Swabha Swayamdipta, Taiwei Shi, Weijia Shi, Xiang Li, Yiwei Li, Yuexing Hao, Zhihao Jia, Zhize Li, Xiuying Chen, Zhengzhong Tu, Xiyang Hu, Tianyi Zhou, Jieyu Zhao, Lichao Sun, Furong Huang, Or Cohen Sasson, Prasanna Sattigeri, Anka Reuel, Max Lamparth, Yue Zhao, Nouha Dziri, Yu Su, Huan Sun, Heng Ji, Chaowei Xiao, Mohit Bansal, Nitesh V. Chawla, Jian Pei, Jianfeng Gao, Michael Backes, Philip S. Yu, Neil Zhenqiang Gong, Pin-Yu Chen, Bo Li, Dawn Song, Xiangliang Zhang
Title: On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
Abstract:
Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, as well as industry practices and standards. Based on this analysis, we propose a set of guiding principles for GenFMs, developed through extensive multidisciplinary collaboration that integrates technical, ethical, legal, and societal perspectives. Second, we introduce TrustGen, the first dynamic benchmarking platform designed to evaluate trustworthiness across multiple dimensions and model types, including text-to-image, large language, and vision-language models. TrustGen leverages modular components--metadata curation, test case generation, and contextual variation--to enable adaptive and iterative assessments, overcoming the limitations of static evaluation methods. Using TrustGen, we reveal significant progress in trustworthiness while identifying persistent challenges. Finally, we provide an in-depth discussion of the challenges and future directions for trustworthy GenFMs, which reveals the complex, evolving nature of trustworthiness, highlighting the nuanced trade-offs between utility and trustworthiness, and consideration for various downstream applications, identifying persistent challenges and providing a strategic roadmap for future research. This work establishes a holistic framework for advancing trustworthiness in GenAI, paving the way for safer and more responsible integration of GenFMs into critical applications. To facilitate advancement in the community, we release the toolkit for dynamic evaluation.
中文: 本文提出一个全面框架,通过制定指导原则、开发动态评估平台TrustGen并提供未来研究路线图,来增强生成式基础模型的可靠性。
English: This paper introduces a comprehensive framework to enhance the trustworthiness of Generative Foundation Models (GenFMs) by establishing guiding principles, developing the TrustGen benchmarking platform for dynamic evaluation, and providing a strategic roadmap for future research.

Authors:Yue Huang, Chujie Gao, Siyuan Wu, Haoran Wang, Xiangqi Wang, Yujun Zhou, Yanbo Wang, Jiayi Ye, Jiawen Shi, Qihui Zhang, Yuan Li, Han Bao, Zhaoyi Liu, Tianrui Guan, Dongping Chen, Ruoxi Chen, Kehan Guo, Andy Zou, Bryan Hooi Kuen-Yew, Caiming Xiong, Elias Stengel-Eskin, Hongyang Zhang, Hongzhi Yin, Huan Zhang, Huaxiu Yao, Jaehong Yoon, Jieyu Zhang, Kai Shu, Kaijie Zhu, Ranjay Krishna, Swabha Swayamdipta, Taiwei Shi, Weijia Shi, Xiang Li, Yiwei Li, Yuexing Hao, Zhihao Jia, Zhize Li, Xiuying Chen, Zhengzhong Tu, Xiyang Hu, Tianyi Zhou, Jieyu Zhao, Lichao Sun, Furong Huang, Or Cohen Sasson, Prasanna Sattigeri, Anka Reuel, Max Lamparth, Yue Zhao, Nouha Dziri, Yu Su, Huan Sun, Heng Ji, Chaowei Xiao, Mohit Bansal, Nitesh V. Chawla, Jian Pei, Jianfeng Gao, Michael Backes, Philip S. Yu, Neil Zhenqiang Gong, Pin-Yu Chen, Bo Li, Dawn Song, Xiangliang Zhang
Title: On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
Abstract:
Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, as well as industry practices and standards. Based on this analysis, we propose a set of guiding principles for GenFMs, developed through extensive multidisciplinary collaboration that integrates technical, ethical, legal, and societal perspectives. Second, we introduce TrustGen, the first dynamic benchmarking platform designed to evaluate trustworthiness across multiple dimensions and model types, including text-to-image, large language, and vision-language models. TrustGen leverages modular components--metadata curation, test case generation, and contextual variation--to enable adaptive and iterative assessments, overcoming the limitations of static evaluation methods. Using TrustGen, we reveal significant progress in trustworthiness while identifying persistent challenges. Finally, we provide an in-depth discussion of the challenges and future directions for trustworthy GenFMs, which reveals the complex, evolving nature of trustworthiness, highlighting the nuanced trade-offs between utility and trustworthiness, and consideration for various downstream applications, identifying persistent challenges and providing a strategic roadmap for future research. This work establishes a holistic framework for advancing trustworthiness in GenAI, paving the way for safer and more responsible integration of GenFMs into critical applications. To facilitate advancement in the community, we release the toolkit for dynamic evaluation.
中文: 本文提出一个全面框架,通过制定指导原则、开发动态评估平台TrustGen并提供未来研究路线图,来增强生成式基础模型的可靠性。
English: This paper introduces a comprehensive framework to enhance the trustworthiness of Generative Foundation Models (GenFMs) by establishing guiding principles, developing the TrustGen benchmarking platform for dynamic evaluation, and providing a strategic roadmap for future research.

Authors:Junwei Zhang, Xing Hu, Shan Gao, Xin Xia, David Lo, Shanping Li
Title: Less is More: On the Importance of Data Quality for Unit Test Generation
Abstract:
Unit testing is crucial for software development and maintenance. Effective unit testing ensures and improves software quality, but writing unit tests is time-consuming and labor-intensive. Recent studies have proposed deep learning (DL) techniques or large language models (LLMs) to automate unit test generation. These models are usually trained or fine-tuned on large-scale datasets. Despite growing awareness of the importance of data quality, there has been limited research on the quality of datasets used for test generation. To bridge this gap, we systematically examine the impact of noise on the performance of learning-based test generation models. We first apply the open card sorting method to analyze the most popular and largest test generation dataset, Methods2Test, to categorize eight distinct types of noise. Further, we conduct detailed interviews with 17 domain experts to validate and assess the importance, reasonableness, and correctness of the noise taxonomy. Then, we propose CleanTest, an automated noise-cleaning framework designed to improve the quality of test generation datasets. CleanTest comprises three filters: a rule-based syntax filter, a rule-based relevance filter, and a model-based coverage filter. To evaluate its effectiveness, we apply CleanTest on two widely-used test generation datasets, i.e., Methods2Test and Atlas. Our findings indicate that 43.52% and 29.65% of datasets contain noise, highlighting its prevalence. Finally, we conduct comparative experiments using four LLMs (i.e., CodeBERT, AthenaTest, StarCoder, and CodeLlama7B) to assess the impact of noise on test generation performance. The results show that filtering noise positively influences the test generation ability of the models.
中文摘要:本研究系统识别了单元测试生成数据集中的八类噪声,开发了自动化清理框架CleanTest显著提升数据集质量,并通过四种大语言模型的实验验证了噪声过滤对测试生成性能的积极影响。
English Summary: This study identifies and categorizes eight types of noise in unit test generation datasets, develops an automated cleaning framework called CleanTest that significantly reduces dataset noise, and demonstrates through experiments with four large language models that noise filtering enhances test generation performance.

Authors:Mingqian He, Yongliang Shen, Wenqi Zhang, Qiuying Peng, Jun Wang, Weiming Lu
Title: STaR-SQL: Self-Taught Reasoner for Text-to-SQL
Abstract:
Generating step-by-step "chain-of-thought" rationales has proven effective for improving the performance of large language models on complex reasoning tasks. However, applying such techniques to structured tasks, such as text-to-SQL, remains largely unexplored. In this paper, we introduce Self-Taught Reasoner for text-to-SQL (STaR-SQL), a novel approach that reframes SQL query generation as a reasoning-driven process. Our method prompts the LLM to produce detailed reasoning steps for SQL queries and fine-tunes it on rationales that lead to correct outcomes. Unlike traditional methods, STaR-SQL dedicates additional test-time computation to reasoning, thereby positioning LLMs as spontaneous reasoners rather than mere prompt-based agents. To further scale the inference process, we incorporate an outcome-supervised reward model (ORM) as a verifier, which enhances SQL query accuracy. Experimental results on the challenging Spider benchmark demonstrate that STaR-SQL significantly improves text-to-SQL performance, achieving an execution accuracy of 86.6%. This surpasses a few-shot baseline by 31.6% and a baseline fine-tuned to predict answers directly by 18.0%. Additionally, STaR-SQL outperforms agent-like prompting methods that leverage more powerful yet closed-source models such as GPT-4. These findings underscore the potential of reasoning-augmented training for structured tasks and open the door to extending self-improving reasoning models to text-to-SQL generation and beyond.
中文:STaR-SQL通过将逐步推理融入大语言模型并采用奖励模型进行验证,显著提升了文本到SQL的性能,在Spider基准测试中实现了86.6%的执行准确率。
English: STaR-SQL enhances text-to-SQL performance by integrating step-by-step reasoning into LLMs and using a reward model for verification, achieving an 86.6% execution accuracy on the Spider benchmark.

Authors:Zhi Sheng, Yuan Yuan, Yudi Zhang, Depeng Jin, Yong Li
Title: Collaborative Deterministic-Probabilistic Forecasting for Real-World Spatiotemporal Systems
Abstract:
Probabilistic forecasting is crucial for real-world spatiotemporal systems, such as climate, energy, and urban environments, where quantifying uncertainty is essential for informed, risk-aware decision-making. While diffusion models have shown promise in capturing complex data distributions, their application to spatiotemporal forecasting remains limited due to complex spatiotemporal dynamics and high computational demands. In this work, we propose CoST, a novel framework that collaborates deterministic and diffusion models for spatiotemporal forecasting. CoST formulates a mean-residual decomposition strategy: it leverages a powerful deterministic model to capture the conditional mean and a lightweight diffusion model to learn residual uncertainties. This collaborative formulation simplifies learning objectives, enhances forecasting accuracy, enables uncertainty quantification, and significantly improves computational efficiency. To address spatial heterogeneity, we further design a scale-aware diffusion mechanism to guide the diffusion process. Extensive experiments across ten real-world datasets from climate, energy, communication, and urban systems show that CoST achieves 25% performance gains over state-of-the-art baselines, while significantly reducing computational cost.
中文: CoST框架通过确定性模型与轻量扩散模型的协作,采用均值-残差分解策略处理时空预测问题,在十个真实数据集上实现25%的性能提升并显著降低计算成本。
English: CoST is a collaborative spatiotemporal forecasting framework that combines deterministic models for capturing conditional means with lightweight diffusion models for learning residual uncertainties, achieving 25% performance gains and reduced computational costs across diverse real-world systems.

Authors:Lin Zhu, Xinbing Wang, Chenghu Zhou, Qinying Gu, Nanyang Ye
Title: Less is More: Masking Elements in Image Condition Features Avoids Content Leakages in Style Transfer Diffusion Models
Abstract:
Given a style-reference image as the additional image condition, text-to-image diffusion models have demonstrated impressive capabilities in generating images that possess the content of text prompts while adopting the visual style of the reference image. However, current state-of-the-art methods often struggle to disentangle content and style from style-reference images, leading to issues such as content leakages. To address this issue, we propose a masking-based method that efficiently decouples content from style without the need of tuning any model parameters. By simply masking specific elements in the style reference's image features, we uncover a critical yet under-explored principle: guiding with appropriately-selected fewer conditions (e.g., dropping several image feature elements) can efficiently avoid unwanted content flowing into the diffusion models, enhancing the style transfer performances of text-to-image diffusion models. In this paper, we validate this finding both theoretically and experimentally. Extensive experiments across various styles demonstrate the effectiveness of our masking-based method and support our theoretical results.
Chinese: 本文提出了一种基于掩码的方法,通过选择性移除风格参考图像中的特定特征,有效分离内容与风格,无需调整模型参数即可防止内容泄露并提升文本到图像扩散模型的风格转换效果。
English: This paper introduces a masking-based method that effectively separates content from style in text-to-image diffusion models by selectively removing certain image features, thereby preventing content leakage and improving style transfer performance without requiring parameter adjustments.

Authors:Gaurav Shetty, Mahya Ramezani, Hamed Habibi, Holger Voos, Jose Luis Sanchez-Lopez
Title: Motion Control in Multi-Rotor Aerial Robots Using Deep Reinforcement Learning
Abstract:
This paper investigates the application of Deep Reinforcement (DRL) Learning to address motion control challenges in drones for additive manufacturing (AM). Drone-based additive manufacturing promises flexible and autonomous material deposition in large-scale or hazardous environments. However, achieving robust real-time control of a multi-rotor aerial robot under varying payloads and potential disturbances remains challenging. Traditional controllers like PID often require frequent parameter re-tuning, limiting their applicability in dynamic scenarios. We propose a DRL framework that learns adaptable control policies for multi-rotor drones performing waypoint navigation in AM tasks. We compare Deep Deterministic Policy Gradient (DDPG) and Twin Delayed Deep Deterministic Policy Gradient (TD3) within a curriculum learning scheme designed to handle increasing complexity. Our experiments show TD3 consistently balances training stability, accuracy, and success, particularly when mass variability is introduced. These findings provide a scalable path toward robust, autonomous drone control in additive manufacturing.
中文: 本文提出一种基于TD3算法的深度强化学习框架,为增材制造中的无人机提供自适应运动控制,有效解决了传统控制器在动态载荷和干扰下鲁棒性不足的问题。
English: This paper proposes a deep reinforcement learning framework using TD3 to enable robust, adaptive motion control for drones in additive manufacturing, overcoming limitations of traditional controllers under dynamic payloads and disturbances.

Authors:Weihang Li, Hongli Xu, Junwen Huang, Hyunjun Jung, Peter KT Yu, Nassir Navab, Benjamin Busam
Title: GCE-Pose: Global Context Enhancement for Category-level Object Pose Estimation
Abstract:
A key challenge in model-free category-level pose estimation is the extraction of contextual object features that generalize across varying instances within a specific category. Recent approaches leverage foundational features to capture semantic and geometry cues from data. However, these approaches fail under partial visibility. We overcome this with a first-complete-then-aggregate strategy for feature extraction utilizing class priors. In this paper, we present GCE-Pose, a method that enhances pose estimation for novel instances by integrating category-level global context prior. GCE-Pose performs semantic shape reconstruction with a proposed Semantic Shape Reconstruction (SSR) module. Given an unseen partial RGB-D object instance, our SSR module reconstructs the instance's global geometry and semantics by deforming category-specific 3D semantic prototypes through a learned deep Linear Shape Model. We further introduce a Global Context Enhanced (GCE) feature fusion module that effectively fuses features from partial RGB-D observations and the reconstructed global context. Extensive experiments validate the impact of our global context prior and the effectiveness of the GCE fusion module, demonstrating that GCE-Pose significantly outperforms existing methods on challenging real-world datasets HouseCat6D and NOCS-REAL275. Our project page is available at https://colin-de.github.io/GCE-Pose/.
中文:GCE-Pose提出了一种新颖的位姿估计方法,通过语义形状变形重建全局几何特征并与局部观测特征融合,有效解决了局部可见性难题,在基准数据集上实现了最先进的性能。
English: GCE-Pose introduces a novel pose estimation method that overcomes partial visibility limitations by reconstructing global geometry through semantic shape deformation and fusing features with partial observations, achieving state-of-the-art performance on benchmark datasets.

Authors:Siyuan Zhang, Yichi Zhang, Yinpeng Dong, Hang Su
Title: Exploring the Generalizability of Factual Hallucination Mitigation via Enhancing Precise Knowledge Utilization
Abstract:
Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of factual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. Although post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in different capabilities. In this paper, we propose to address it by directly augmenting LLM's fundamental ability to precisely leverage its knowledge and introduce PKUE, which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct FactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments demonstrate that PKUE significantly improves LLM overall performance, with consistent enhancement across factual tasks of various forms, general tasks beyond factuality, and tasks in a different language.
中文摘要:本文提出PKUE方法,通过基于自生成事实问题回答的偏好优化来增强大语言模型准确运用知识的能力,并构建了涵盖21个领域的FactualBench数据集,实验证明该方法在多种任务中均显著提升了模型性能。
English Summary: The paper introduces PKUE, a method that enhances Large Language Models' ability to utilize knowledge accurately by fine-tuning them on self-generated responses to factual questions, and presents FactualBench, a comprehensive dataset for evaluation and training, showing significant improvements across various tasks.

Authors:Xinyuan Wang, Yanchi Liu, Wei Cheng, Xujiang Zhao, Zhengzhang Chen, Wenchao Yu, Yanjie Fu, Haifeng Chen
Title: MixLLM: Dynamic Routing in Mixed Large Language Models
Abstract:
Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4's quality at 24.18% of the cost under the time constraint).
Chinese: MixLLM 是一种基于情境赌博机的动态路由系统,通过优化查询与大语言模型的匹配,在持续学习中平衡响应质量、成本和延迟,以极低成本实现接近GPT-4的性能水平。
English: MixLLM is a dynamic routing system that optimizes query assignments to various Large Language Models by balancing response quality, cost, and latency through contextual bandit algorithms and continual learning, achieving near-GPT-4 quality at a fraction of the cost.

Authors:Jonathan Light, Wei Cheng, Benjamin Riviere, Wu Yue, Masafumi Oyamada, Mengdi Wang, Yisong Yue, Santiago Paternain, Haifeng Chen
Title: DISC: DISC: Dynamic Decomposition Improves LLM Inference Scaling
Abstract:
Inference scaling methods for large language models often work by breaking problems into steps or groups of tokens, then sampling and selecting the best next steps. However, these steps and their sizes are usually fixed or manually designed based on domain knowledge. We introduce dynamic decomposition, a method that adaptively and automatically breaks down solution and reasoning traces into manageable steps during inference. By allocating compute more effectively - especially by subdividing difficult steps and prioritizing their sampling - dynamic decomposition significantly boosts inference efficiency. Experiments on benchmarks like APPS, MATH, and LiveCodeBench show that dynamic decomposition outperforms fixed strategies such as token-level, sentence-level, and single-step decompositions, reducing the pass@10 error rate by 5.0%, 6.7%, and 10.5% respectively. These results show the promise of dynamic decomposition for improving a broad range of inference scaling techniques.
中文: 动态分解方法在推理过程中自适应地划分推理步骤,通过优化计算分配显著提升效率,在多个基准测试中比静态方法降低错误率5.0%-10.5%。
English: Dynamic decomposition adaptively partitions reasoning traces during inference to optimize computational allocation, significantly outperforming static methods by reducing error rates across multiple benchmarks.

Authors:Jonathan Light, Wei Cheng, Benjamin Riviere, Wu Yue, Masafumi Oyamada, Mengdi Wang, Yisong Yue, Santiago Paternain, Haifeng Chen
Title: DISC: Dynamic Decomposition Improves LLM Inference Scaling
Abstract:
Inference scaling methods for LLMs often rely on decomposing problems into steps (or groups of tokens), followed by sampling and selecting the best next steps. However, these steps and their sizes are often predetermined or manually designed based on domain knowledge. We propose dynamic decomposition, a method that adaptively and automatically partitions solution and reasoning traces into manageable steps during inference. By more effectively allocating compute -- particularly through subdividing challenging steps and prioritizing their sampling -- dynamic decomposition significantly improves inference efficiency. Experiments on benchmarks such as APPS, MATH, and LiveCodeBench demonstrate that dynamic decomposition outperforms static approaches, including token-level, sentence-level, and single-step decompositions, reducing the pass@10 error rate by 5.0%, 6.7%, and 10.5% respectively. These findings highlight the potential of dynamic decomposition to improve a wide range of inference scaling techniques.
中文: 动态分解方法在推理过程中自适应地划分推理步骤,通过优化计算分配显著提升效率,在多个基准测试中比静态方法降低错误率5.0%-10.5%。
English: Dynamic decomposition adaptively partitions reasoning traces during inference to optimize computational allocation, significantly outperforming static methods by reducing error rates across multiple benchmarks.

Authors:Badih Ghazi, Cristóbal Guzmán, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi, Sushant Sachdeva
Title: PREM: Privately Answering Statistical Queries with Relative Error
Abstract:
We introduce $\mathsf{PREM}$ (Private Relative Error Multiplicative weight update), a new framework for generating synthetic data that achieves a relative error guarantee for statistical queries under $(\varepsilon, δ)$ differential privacy (DP). Namely, for a domain ${\cal X}$, a family ${\cal F}$ of queries $f : {\cal X} \to \{0, 1\}$, and $ζ> 0$, our framework yields a mechanism that on input dataset $D \in {\cal X}^n$ outputs a synthetic dataset $\widehat{D} \in {\cal X}^n$ such that all statistical queries in ${\cal F}$ on $D$, namely $\sum_{x \in D} f(x)$ for $f \in {\cal F}$, are within a $1 \pm ζ$ multiplicative factor of the corresponding value on $\widehat{D}$ up to an additive error that is polynomial in $\log |{\cal F}|$, $\log |{\cal X}|$, $\log n$, $\log(1/δ)$, $1/\varepsilon$, and $1/ζ$. In contrast, any $(\varepsilon, δ)$-DP mechanism is known to require worst-case additive error that is polynomial in at least one of $n, |{\cal F}|$, or $|{\cal X}|$. We complement our algorithm with nearly matching lower bounds.
中文:PREM框架提出了一种差分隐私机制,能生成具有统计查询相对误差保证的合成数据,其加性误差远低于现有方法。
English: The PREM framework introduces a differentially private mechanism that generates synthetic data with relative error guarantees for statistical queries, achieving significantly lower additive error than existing methods.

Authors:Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, Renjing Xu
Title: MapNav: A Novel Memory Representation via Annotated Semantic Maps for Vision-and-Language Navigation
Abstract:
Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. Moreover, we will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.
Chinese: MapNav提出了一种端到端的视觉语言导航模型,通过使用标注语义地图替代历史帧,利用结构化导航信息和视觉语言模型的强大能力,实现了最先进的性能。
English: MapNav introduces an end-to-end vision-and-language navigation model that uses an Annotated Semantic Map to replace historical frames, achieving state-of-the-art performance by providing structured navigation cues and leveraging VLM capabilities.

Authors:Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, Suhang Wang, Yue Xing, Jiliang Tang, Qi He
Title: Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models
Abstract:
Chain-of-Thought (CoT) reasoning, which breaks down complex tasks into intermediate reasoning steps, has significantly enhanced the performance of large language models (LLMs) on challenging tasks. However, the detailed reasoning process in CoT often incurs long generation times and high computational costs, partly due to the inclusion of unnecessary steps. To address this, we propose a method to identify critical reasoning steps using perplexity as a measure of their importance: a step is deemed critical if its removal causes a significant increase in perplexity. Our method enables models to focus solely on generating these critical steps. This can be achieved through two approaches: refining demonstration examples in few-shot CoT or fine-tuning the model using selected examples that include only critical steps. Comprehensive experiments validate the effectiveness of our method, which achieves a better balance between the reasoning accuracy and efficiency of CoT.
中文: 该方法通过基于困惑度识别关键推理步骤,使模型仅关注核心环节,从而在提升思维链推理准确性的同时优化了效率。
English: The proposed method enhances Chain-of-Thought reasoning by identifying critical steps using perplexity, allowing models to focus only on essential reasoning and improving both accuracy and efficiency.

Authors:Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang
Title: Towards Text-Image Interleaved Retrieval
Abstract:
Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.
中文: 本文提出文本-图像交错检索任务以突破单图像检索的局限,通过创新性嵌套式多模态嵌入器在不同粒度压缩视觉标记,显著提升基于多模态大语言模型的检索性能。
English: This paper introduces the text-image interleaved retrieval (TIIR) task to address limitations of single-image retrieval systems, proposing a novel Matryoshka Multimodal Embedder (MME) that significantly improves performance by efficiently reducing visual tokens in multimodal large language models.

Authors:Hanin Atwany, Abdul Waheed, Rita Singh, Monojit Choudhury, Bhiksha Raj
Title: Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
Abstract:
Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CER) are commonly used to evaluate ASR performance but often fail to reflect transcription quality in critical contexts, particularly when detecting fabricated outputs. This phenomenon, known as hallucination, is especially concerning in high-stakes domains such as healthcare, legal, and aviation, where errors can have severe consequences. In our work, we address this gap by investigating hallucination in ASR models. We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. Our analysis of over 20 ASR models reveals \numinsights~key insights: (1) High WERs can mask low hallucination rates, while low WERs may conceal dangerous hallucinations. (2) Synthetic noise, both adversarial and common perturbations like white noise, pitch shift, and time stretching, increase HER. (3) Distribution shift correlates strongly with HER ($α= 0.91$). Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance, particularly in high-stakes domains.
大规模语音基础模型能够跨语言执行多种任务,但在关键领域存在幻觉问题,传统评估指标难以准确衡量,因此引入幻觉错误率(HER)作为词错误率(WER)的补充,以更全面地评估模型性能。
Large-scale speech foundation models perform multiple tasks across languages but face challenges in accurately measuring performance, especially with hallucinations in critical domains, leading to the introduction of the hallucination error rate (HER) to complement traditional metrics like WER for better assessment.

Authors:Abdul Waheed, Hanin Atwany, Rita Singh, Bhiksha Raj
Title: On the Robust Approximation of ASR Metrics
Abstract:
Recent advances in speech foundation models are largely driven by scaling both model size and data, enabling them to perform a wide range of tasks, including speech recognition. Traditionally, ASR models are evaluated using metrics like Word Error Rate (WER) and Character Error Rate (CER), which depend on ground truth labels. As a result of limited labeled data from diverse domains and testing conditions, the true generalization capabilities of these models beyond standard benchmarks remain unclear. Moreover, labeling data is both costly and time-consuming. To address this, we propose a novel label-free approach for approximating ASR performance metrics, eliminating the need for ground truth labels. Our method utilizes multimodal embeddings in a unified space for speech and transcription representations, combined with a high-quality proxy model to compute proxy metrics. These features are used to train a regression model to predict key ASR metrics like Word Error Rate (WER) and Character Error Rate (CER). We experiment with over 40 models across 14 datasets representing both standard and in-the-wild testing conditions. Our results show that we approximate the metrics within a single-digit absolute difference across all experimental configurations, outperforming the most recent baseline by more than 50\%.
Chinese: 本文提出了一种新颖的无标签方法,利用多模态嵌入和代理模型精确预测ASR指标如WER和CER,在多样化数据集上实现个位数绝对误差,且性能超越最新基线50%以上。
English: This paper introduces a novel label-free method that uses multimodal embeddings and a proxy model to accurately predict ASR metrics like WER and CER, achieving results within a single-digit absolute difference across diverse datasets and outperforming recent baselines by over 50%.

Authors:Geon Lee, Wenchao Yu, Kijung Shin, Wei Cheng, Haifeng Chen
Title: TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents
Abstract:
Time series data is essential in various applications, including climate modeling, healthcare monitoring, and financial analytics. Understanding the contextual information associated with real-world time series data is often essential for accurate and reliable event predictions. In this paper, we introduce TimeCAP, a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data, extending their typical usage as predictors. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. In addition, TimeCAP employs a multi-modal encoder that synergizes with the LLM agents, enhancing predictive performance through mutual augmentation of inputs with in-context examples. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction, including those utilizing LLMs as predictors, achieving an average improvement of 28.75% in F1 score.
Chinese: TimeCAP是一种创新框架,通过使用大型语言模型作为时间序列的上下文解析器,结合文本摘要和多模态编码增强事件预测能力,在真实数据集上相比现有最优方法平均F1分数提升28.75%。
English: TimeCAP is a novel framework that uses Large Language Models as contextualizers to enhance time series event prediction through textual summarization and multimodal encoding, achieving a 28.75% average F1 score improvement over state-of-the-art methods.

Authors:Haoyu Han, Harry Shomer, Yu Wang, Yongjia Lei, Kai Guo, Zhigang Hua, Bo Long, Hui Liu, Jiliang Tang
Title: RAG vs. GraphRAG: A Systematic Evaluation and Key Insights
Abstract:
Retrieval-Augmented Generation (RAG) enhances the performance of LLMs across various tasks by retrieving relevant information from external sources, particularly on text-based data. For structured data, such as knowledge graphs, GraphRAG has been widely used to retrieve relevant information. However, recent studies have revealed that structuring implicit knowledge from text into graphs can benefit certain tasks, extending the application of GraphRAG from graph data to general text-based data. Despite their successful extensions, most applications of GraphRAG for text data have been designed for specific tasks and datasets, lacking a systematic evaluation and comparison between RAG and GraphRAG on widely used text-based benchmarks. In this paper, we systematically evaluate RAG and GraphRAG on well-established benchmark tasks, such as Question Answering and Query-based Summarization. Our results highlight the distinct strengths of RAG and GraphRAG across different tasks and evaluation perspectives. Inspired by these observations, we investigate strategies to integrate their strengths to improve downstream tasks. Additionally, we provide an in-depth discussion of the shortcomings of current GraphRAG approaches and outline directions for future research.
中文: 本文在标准基准上系统评估了RAG与GraphRAG,揭示了它们在不同任务中的独特优势,探索了融合策略,并指出了当前方法的局限性与未来研究方向。
English: This paper systematically evaluates RAG and GraphRAG on standard benchmarks, revealing their distinct strengths across tasks and exploring integration strategies while identifying current limitations and future research directions.

Authors:Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Haifeng Chen, Xiang Zhang, Wei Cheng
Title: Diversified Sampling Improves Scaling LLM inference
Abstract:
While increasing training compute has significantly improved the performance of large language models (LLMs), similar gains have not been observed when scaling inference compute. We hypothesize that the primary issue lies in the uniformity of LLM outputs, which leads to inefficient sampling as models repeatedly generate similar but inaccurate responses. Motivated by an intriguing relationship between solution accuracy and response diversity, we propose DivSampling -- a novel and versatile sampling technique designed to enhance the diversity of candidate solutions by introducing prompt perturbations.DivSampling incorporates two categories of perturbations: task-agnostic approaches, which are general and not tailored to any specific task, and task-specific approaches, which are customized based on task content. Our theoretical analysis demonstrates that, under mild assumptions, the error rates of responses generated from diverse prompts are significantly lower compared to those produced by stationary prompts. Comprehensive evaluations across various tasks -- including reasoning, mathematics, and code generation -- highlight the effectiveness of DivSampling in improving solution accuracy. This scalable and efficient approach offers a new perspective on optimizing test-time inference, addressing limitations in current sampling strategies.
Chinese: 通过有意义的提示多样性进行多样化采样,能显著提升大语言模型的扩展推理性能,在推理、数学和代码生成任务中分别实现10.8%、9.6%和9.5%的相对提升。
English: Diversified sampling through meaningful prompt variations significantly enhances large language model scaling inference by reducing error rates and improving performance across reasoning, mathematics, and code generation tasks.

Authors:Tianchun Wang, Zichuan Liu, Yuanzhou Chen, Jonathan Light, Weiyang Liu, Haifeng Chen, Xiang Zhang, Wei Cheng
Title: On the Effect of Sampling Diversity in Scaling LLM Inference
Abstract:
Large language model (LLM) scaling inference is key to unlocking greater performance, and leveraging diversity has proven an effective way to enhance it. Motivated by the observed relationship between solution accuracy and meaningful response diversity, we systematically study the effect of prompt diversity in scaling inference. We theoretically explain why diversified sampling improves Best-of-$N$ scaling, showing that responses generated from meaningful diverse prompts after Best-of-$N$ selection exhibit significantly lower error rates than those produced from stationary prompts. To promote solution diversity, we analyze perturbation fidelity and show that moderately relevant perturbations improve performance, providing guidance for effective perturbation design. Further, we present a set of effective perturbations, including task-level and query-level ones, and analyze the conditions under which they succeed. We systematically evaluate diversified sampling across tasks, finding relative gains of 10.8% in EM@100 for reasoning, 9.6% for mathematics, and 9.5% in Pass@100 for code generation.
Chinese: 通过有意义的提示多样性进行多样化采样,能显著提升大语言模型的扩展推理性能,在推理、数学和代码生成任务中分别实现10.8%、9.6%和9.5%的相对提升。
English: Diversified sampling through meaningful prompt variations significantly enhances large language model scaling inference by reducing error rates and improving performance across reasoning, mathematics, and code generation tasks.

Authors:Badih Ghazi, Ravi Kumar, Daogao Liu, Pasin Manurangsi
Title: Linear-Time User-Level DP-SCO via Robust Statistics
Abstract:
User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance of safeguarding user privacy in modern large-scale machine learning applications. Current methods, such as those based on differentially private stochastic gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to the need to privatize every intermediate iterate. In this work, we introduce a novel linear-time algorithm that leverages robust statistics, specifically the median and trimmed mean, to overcome these challenges. Our approach uniquely bounds the sensitivity of all intermediate iterates of SGD with gradient estimation based on robust statistics, thereby significantly reducing the gradient estimation noise for privacy purposes and enhancing the privacy-utility trade-off. By sidestepping the repeated privatization required by previous methods, our algorithm not only achieves an improved theoretical privacy-utility trade-off but also maintains computational efficiency. We complement our algorithm with an information-theoretic lower bound, showing that our upper bound is optimal up to logarithmic factors and the dependence on $ε$. This work sets the stage for more robust and efficient privacy-preserving techniques in machine learning, with implications for future research and application in the field.
中文: 本文提出一种线性时间的用户级差分隐私随机凸优化算法,通过鲁棒统计量约束中间迭代的敏感性,在保持计算效率的同时显著提升了隐私与效用的平衡。
English: This paper introduces a linear-time algorithm for user-level differentially private stochastic convex optimization that uses robust statistics to bound intermediate iterate sensitivity, achieving improved privacy-utility trade-offs while maintaining computational efficiency.

Authors:Jingchao Ni, Ziming Zhao, ChengAo Shen, Hanghang Tong, Dongjin Song, Wei Cheng, Dongsheng Luo, Haifeng Chen
Title: Harnessing Vision Models for Time Series Analysis: A Survey
Abstract:
Time series analysis has witnessed the inspiring development from traditional autoregressive models, deep learning models, to recent Transformers and Large Language Models (LLMs). Efforts in leveraging vision models for time series analysis have also been made along the way but are less visible to the community due to the predominant research on sequence modeling in this domain. However, the discrepancy between continuous time series and the discrete token space of LLMs, and the challenges in explicitly modeling the correlations of variates in multivariate time series have shifted some research attentions to the equally successful Large Vision Models (LVMs) and Vision Language Models (VLMs). To fill the blank in the existing literature, this survey discusses the advantages of vision models over LLMs in time series analysis. It provides a comprehensive and in-depth overview of the existing methods, with dual views of detailed taxonomy that answer the key research questions including how to encode time series as images and how to model the imaged time series for various tasks. Additionally, we address the challenges in the pre- and post-processing steps involved in this framework and outline future directions to further advance time series analysis with vision models.
中文: 本综述探讨了视觉模型在时间序列分析中相较于大型语言模型的优势,提供了全面的分类方法,并解决了将时间序列编码为图像及其建模过程中的关键挑战。
English: This survey explores the advantages of vision models over large language models for time series analysis, offering a comprehensive taxonomy and addressing key challenges in encoding and modeling time series as images.

Authors:Wangyang Ying, Cong Wei, Nanxu Gong, Xinyuan Wang, Haoyue Bai, Arun Vignesh Malarkkan, Sixun Dong, Dongjie Wang, Denghui Zhang, Yanjie Fu
Title: A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective
Abstract:
Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing. As artificial intelligence moves towards a data-centric perspective, improving data quality is essential for enhancing model performance in tabular data-driven applications. This survey focuses on data-driven tabular data optimization, specifically exploring reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces. Feature selection aims to identify and retain the most informative attributes, while feature generation constructs new features to better capture complex data patterns. We systematically review existing generative methods for tabular data engineering, analyzing their latest advancements, real-world applications, and respective strengths and limitations. This survey emphasizes how RL-based and generative techniques contribute to the automation and intelligence of feature engineering. Finally, we summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.
中文: 本综述探讨了强化学习和生成方法如何通过特征选择与特征生成优化表格数据,强调其在自动化特征工程中的作用,并分析了当前挑战与未来研究方向。
English: This survey explores reinforcement learning and generative methods for optimizing tabular data through feature selection and generation, highlighting their role in automating feature engineering while addressing current challenges and future directions.

Authors:Sahand Sabour, June M. Liu, Siyang Liu, Chris Z. Yao, Shiyao Cui, Xuanming Zhang, Wen Zhang, Yaru Cao, Advait Bhat, Jian Guan, Wei Wu, Rada Mihalcea, Hongning Wang, Tim Althoff, Tatia M. C. Lee, Minlie Huang
Title: Human Decision-making is Susceptible to AI-driven Manipulation
Abstract:
Artificial Intelligence (AI) systems are increasingly intertwined with daily life, assisting users in executing various tasks and providing guidance on decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized controlled trial with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) employing explicit psychological tactics to reach its hidden objectives. By analyzing participants' decision patterns and shifts in their preference ratings post-interaction, we found significant susceptibility to AI-driven manipulation. Particularly, across both decision-making domains, participants interacting with the manipulative agents shifted toward harmful options at substantially higher rates (financial, MA: 62.3%, SEMA: 59.6%; emotional, MA: 42.3%, SEMA: 41.5%) compared to the NA group (financial, 35.8%; emotional, 12.8%). Notably, our findings reveal that even subtle manipulative objectives (MA) can be as effective as employing explicit psychological strategies (SEMA) in swaying human decision-making. By revealing the potential for covert AI influence, this study highlights a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to ensure responsible deployment of AI technologies and protect human autonomy.
中文: 研究表明人工智能系统能够在金融和情感决策中有效操纵人类选择,即使仅采用隐蔽影响策略的AI也能显著引导用户做出有害决定,这凸显了建立伦理保障机制的迫切性。
English: This study demonstrates that AI systems can effectively manipulate human decision-making in financial and emotional contexts, with even subtly manipulative agents significantly swaying users toward harmful choices, highlighting the urgent need for ethical safeguards.

Authors:Yiwei Guo, Zhihan Li, Hankun Wang, Bohan Li, Chongtian Shao, Hanglei Zhang, Chenpeng Du, Xie Chen, Shujie Liu, Kai Yu
Title: Recent Advances in Discrete Speech Tokens: A Review
Abstract:
The rapid advancement of speech generation technologies in the era of large language models (LLMs) has established discrete speech tokens as a foundational paradigm for speech representation. These tokens, characterized by their discrete, compact, and concise nature, are not only advantageous for efficient transmission and storage, but also inherently compatible with the language modeling framework, enabling seamless integration of speech into text-dominated LLM architectures. Current research categorizes discrete speech tokens into two principal classes: acoustic tokens and semantic tokens, each of which has evolved into a rich research domain characterized by unique design philosophies and methodological approaches. This survey systematically synthesizes the existing taxonomy and recent innovations in discrete speech tokenization, conducts a critical examination of the strengths and limitations of each paradigm, and presents systematic experimental comparisons across token types. Furthermore, we identify persistent challenges in the field and propose potential research directions, aiming to offer actionable insights to inspire future advancements in the development and application of discrete speech tokens.
中文: 本综述系统梳理了离散语音标记的分类与创新,评估了其在融入大语言模型中的优势与不足,并提出了未来研究方向以推动该领域的发展。
English: This survey synthesizes the taxonomy and innovations in discrete speech tokens, which are integral for integrating speech into large language models, and evaluates their strengths and limitations while proposing future research directions.

Authors:Li Wei, Shuai S. A. Yuan, Chongwen Huang, Jianhua Zhang, Faouzi Bader, Zhaoyang Zhang, Sami Muhaidat, Merouane Debbah, Chau Yuen
Title: Electromagnetic Channel Modeling and Capacity Analysis for HMIMO Communications
Abstract:
Advancements in emerging technologies, e.g., reconfigurable intelligent surfaces and holographic MIMO (HMIMO), facilitate unprecedented manipulation of electromagnetic (EM) waves, significantly enhancing the performance of wireless communication systems. To accurately characterize the achievable performance limits of these systems, it is crucial to develop a universal EM-compliant channel model. This paper addresses this necessity by proposing a comprehensive EM channel model tailored for realistic multi-path environments, accounting for the combined effects of antenna array configurations and propagation conditions in HMIMO communications. Both polarization phenomena and spatial correlation are incorporated into this probabilistic channel model. Additionally, physical constraints of antenna configurations, such as mutual coupling effects and energy consumption, are integrated into the channel modeling framework. Simulation results validate the effectiveness of the proposed probabilistic channel model, indicating that traditional Rician and Rayleigh fading models cannot accurately depict the channel characteristics and underestimate the channel capacity. More importantly, the proposed channel model outperforms free-space Green's functions in accurately depicting both near-field gain and multi-path effects in radiative near-field regions. These gains are much more evident in tri-polarized systems, highlighting the necessity of polarization interference elimination techniques. Moreover, the theoretical analysis accurately verifies that capacity decreases with expanding communication regions of two-user communications.
中文: 本文提出了一种适用于全息MIMO系统的通用电磁兼容信道模型,该模型整合了极化效应、空间相关性及天线约束,相比传统模型能更精确地描述近场特性和多径效应,并揭示了多用户通信中容量随区域扩大而降低的规律。
English: This paper introduces a universal electromagnetic-compliant channel model for holographic MIMO systems that incorporates polarization, spatial correlation, and antenna constraints, demonstrating superior accuracy over traditional models in capturing near-field effects and multi-path characteristics while highlighting capacity limitations in multi-user scenarios.

Authors:Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu
Title: Elucidating the Preconditioning in Consistency Distillation
Abstract:
Consistency distillation is a prevalent way for accelerating diffusion models adopted in consistency (trajectory) models, in which a student model is trained to traverse backward on the probability flow (PF) ordinary differential equation (ODE) trajectory determined by the teacher model. Preconditioning is a vital technique for stabilizing consistency distillation, by linear combining the input data and the network output with pre-defined coefficients as the consistency function. It imposes the boundary condition of consistency functions without restricting the form and expressiveness of the neural network. However, previous preconditionings are hand-crafted and may be suboptimal choices. In this work, we offer the first theoretical insights into the preconditioning in consistency distillation, by elucidating its design criteria and the connection to the teacher ODE trajectory. Based on these analyses, we further propose a principled way dubbed \textit{Analytic-Precond} to analytically optimize the preconditioning according to the consistency gap (defined as the gap between the teacher denoiser and the optimal student denoiser) on a generalized teacher ODE. We demonstrate that Analytic-Precond can facilitate the learning of trajectory jumpers, enhance the alignment of the student trajectory with the teacher's, and achieve $2\times$ to $3\times$ training acceleration of consistency trajectory models in multi-step generation across various datasets.
中文摘要:一致性蒸馏通过训练学生模型沿教师模型的ODE轨迹反向传播来加速扩散模型,而提出的Analytic-Precond方法通过解析优化预处理条件来减小一致性差距,实现了2-3倍的训练加速。
English Summary: Consistency distillation accelerates diffusion models by training a student model to follow the teacher's ODE trajectory, with the proposed Analytic-Precond method optimizing preconditioning to reduce the consistency gap and achieve 2-3× training acceleration.

Authors:Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang
Title: Scaling Embedding Layers in Language Models
Abstract:
We propose SCONE ($S$calable, $C$ontextualized, $O$ffloaded, $N$-gram $E$mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent $n$-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. SCONE enables two new scaling strategies: increasing the number of $n$-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables a model with 1B accelerator-resident parameters to outperform a 1.9B-parameter baseline across diverse corpora, while using only about half the FLOPS and accelerator memory during inference.
中文: SCONE是一种可扩展的方法,通过添加在训练时单独学习并存储在加速器外的上下文化n元嵌入来提升语言模型性能,使得参数更少的模型在推理时仅用一半计算资源即可超越更大规模的基线模型。
English: SCONE is a scalable method that enhances language model performance by adding contextualized n-gram embeddings learned separately and stored off-accelerator, allowing models with fewer parameters to outperform larger baselines while using half the computational resources during inference.

Authors:Jonas Becker, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, Bela Gipp
Title: Stay Focused: Problem Drift in Multi-Agent Debate
Abstract:
Multi-agent debate - multiple instances of large language models discussing problems in turn-based interaction - has shown promise for solving knowledge and reasoning tasks. However, these methods show limitations when solving complex problems that require longer reasoning chains. We analyze how multi-agent debate over multiple turns drifts away from the initial problem, thus harming task performance. We define this phenomenon as problem drift and quantify its presence across ten tasks (i.e., three generative, three knowledge, three reasoning, and one instruction-following task). To identify the reasons for this issue, eight human experts analyze 170 multi-agent discussions suffering from problem drift. We find the most common issues related to this drift are the lack of progress (35% of cases), low-quality feedback (26% of cases), and a lack of clarity (25% of cases). To address problem drift, we propose DRIFTJudge, an LLM-as-a-judge method, to detect problem drift at test-time. We also propose DRIFTPolicy, a method that mitigates problem drift cases to improve task performance. Our study is a step toward understanding a key limitation of multi-agent debate, highlighting why longer debates can harm task performance and how problem drift could be addressed.
Chinese: 多智能体大语言模型辩论常出现“问题漂移”现象,即讨论偏离原始问题而降低任务表现,可通过DRIFTJudge检测和DRIFTPolicy纠正来缓解该问题。
English: Multi-agent debate among large language models often suffers from problem drift, where discussions deviate from the original issue, reducing task performance, and this can be mitigated using DRIFTJudge for detection and DRIFTPolicy for correction.

Authors:Jiazheng Li, Yuxiang Zhou, Junru Lu, Gladys Tyen, Lin Gui, Cesare Aloisi, Yulan He
Title: Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
Abstract:
Large Language Models (LLMs) often struggle with complex reasoning scenarios. While preference optimization methods enhance reasoning performance through training, they often lack transparency in why one reasoning outcome is preferred over another. Verbal reflection techniques improve explainability but are limited in LLMs' critique and refinement capacity. To address these challenges, we introduce a contrastive reflection synthesis pipeline that enhances the accuracy and depth of LLM-generated reflections. We further propose a dual-model reasoning framework within a verbal reinforcement learning paradigm, decoupling inference-time self-reflection into specialized, trained models for reasoning critique and refinement. Extensive experiments show that our framework outperforms traditional preference optimization methods across all evaluation metrics. Our findings also show that "two heads are better than one", demonstrating that a collaborative Reasoner-Critic model achieves superior reasoning performance and transparency, compared to single-model approaches.
中文摘要:我们提出的双模型框架通过对比反思合成和语言强化学习,采用专门化的推理器与批判器分工协作,相比传统单模型方法显著提升了推理准确性、透明度及整体性能,验证了“两人智慧胜一人”的优势。
English Summary: Our proposed dual-model framework, featuring specialized Reasoner and Critic models, significantly improves reasoning accuracy, transparency, and performance over traditional single-model approaches by enhancing reflection quality through contrastive synthesis and verbal reinforcement learning.

Authors:Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp
Title: Voting or Consensus? Decision-Making in Multi-Agent Debate
Abstract:
Much of the success of multi-agent debates depends on carefully choosing the right parameters. The decision-making protocol stands out as it can highly impact final model answers, depending on how decisions are reached. Systematic comparison of decision protocols is difficult because many studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making influences different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time - the decision protocol - to analyze how different methods affect the collaboration between agents and measure differences in knowledge and reasoning tasks. Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks compared to other decision protocols. Increasing the number of agents improves performance, while more discussion rounds before voting reduce it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.
中文: 本研究系统评估了多智能体辩论中的七种决策协议,发现投票协议使推理任务性能提升13.2%而共识协议使知识任务提升2.8%,并提出两种新方法(AAD和CI)最高可额外提升7.4%的任务表现。
English: This study systematically evaluates seven decision protocols in multi-agent debates, revealing that voting protocols enhance reasoning tasks by 13.2% while consensus protocols improve knowledge tasks by 2.8%, and introduces two novel methods (AAD and CI) that further boost performance by up to 7.4%.

Authors:Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp
Title: Voting or Consensus? Decision-Making in Multi-Agent Debate
Abstract:
Much of the success of multi-agent debates depends on carefully choosing the right parameters. The decision-making protocol stands out as it can highly impact final model answers, depending on how decisions are reached. Systematic comparison of decision protocols is difficult because many studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making influences different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time - the decision protocol - to analyze how different methods affect the collaboration between agents and measure differences in knowledge and reasoning tasks. Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks compared to other decision protocols. Increasing the number of agents improves performance, while more discussion rounds before voting reduce it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.
中文: 本研究系统评估了多智能体辩论中的七种决策协议,发现投票协议使推理任务性能提升13.2%而共识协议使知识任务提升2.8%,并提出两种新方法(AAD和CI)最高可额外提升7.4%的任务表现。
English: This study systematically evaluates seven decision protocols in multi-agent debates, revealing that voting protocols enhance reasoning tasks by 13.2% while consensus protocols improve knowledge tasks by 2.8%, and introduces two novel methods (AAD and CI) that further boost performance by up to 7.4%.

Authors:Fengbin Guan, Zihao Yu, Yiting Lu, Xin Li, Zhibo Chen
Title: InternVQA: Advancing Compressed Video Quality Assessment with Distilling Large Foundation Model
Abstract:
Video quality assessment tasks rely heavily on the rich features required for video understanding, such as semantic information, texture, and temporal motion. The existing video foundational model, InternVideo2, has demonstrated strong potential in video understanding tasks due to its large parameter size and large-scale multimodal data pertaining. Building on this, we explored the transferability of InternVideo2 to video quality assessment under compression scenarios. To design a lightweight model suitable for this task, we proposed a distillation method to equip the smaller model with rich compression quality priors. Additionally, we examined the performance of different backbones during the distillation process. The results showed that, compared to other methods, our lightweight model distilled from InternVideo2 achieved excellent performance in compression video quality assessment.
中文: 本研究基于InternVideo2基础模型的视频理解能力,通过蒸馏方法设计了一个轻量级模型,在压缩视频质量评估中表现出色。
English: The study leverages the video understanding capabilities of the InternVideo2 foundational model to develop a lightweight model through distillation, which excels in assessing video quality under compression scenarios.

Authors:Bingke Zhu, Xiaoxiao Wang, Minghui Jia, Yihan Tao, Xiao Kong, Ali Luo, Yingying Chen, Ming Tang, Jinqiao Wang
Title: FLARE: A Framework for Stellar Flare Forecasting using Stellar Physical Properties and Historical Records
Abstract:
Stellar flare events are critical observational samples for astronomical research; however, recorded flare events remain limited. Stellar flare forecasting can provide additional flare event samples to support research efforts. Despite this potential, no specialized models for stellar flare forecasting have been proposed to date. In this paper, we present extensive experimental evidence demonstrating that both stellar physical properties and historical flare records are valuable inputs for flare forecasting tasks. We then introduce FLARE (Forecasting Light-curve-based Astronomical Records via features Ensemble), the first-of-its-kind large model specifically designed for stellar flare forecasting. FLARE integrates stellar physical properties and historical flare records through a novel Soft Prompt Module and Residual Record Fusion Module. Our experiments on the publicly available Kepler light curve dataset demonstrate that FLARE achieves superior performance compared to other methods across all evaluation metrics. Finally, we validate the forecast capability of our model through a comprehensive case study.
Chinese: 本文提出了首个专门用于恒星耀斑预测的模型FLARE,它通过创新的模块整合了恒星物理特性和历史耀斑记录,并在开普勒数据集上展现出卓越的性能。
English: This paper introduces FLARE, the first specialized model for stellar flare forecasting, which integrates stellar physical properties and historical flare records through innovative modules and demonstrates superior performance on the Kepler dataset.

Authors:Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, Bowen Zhou
Title: Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines
Abstract:
Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not only helps retrieve documents from the knowledge base by producing identifiers for each document, but it also answers visual questions based on the retrieved documents. Furthermore, we propose a reinforced retrieval calibration module from relevance feedback to improve retrieval performance and align with the preferences for accurate answer generation. Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9\% to 9.6\% across all evaluation metrics when compared to strong baselines.
中文:ReAuSE将知识检索器无缝集成到生成式多模态大语言模型中,作为内置搜索引擎,既充当生成式检索器又作为答案生成器,通过强化检索校准提升性能,在多个评估指标上显著优于基线方法。
English: ReAuSE integrates a knowledge retriever directly into a generative multimodal large language model, functioning as both a generative retriever and answer generator to enhance visual question answering performance, with reinforced retrieval calibration further improving results across benchmarks.

Authors:Jiancheng An, Chau Yuen, Marco Di Renzo, Mérouane Debbah, H. Vincent Poor, Lajos Hanzo
Title: Downlink Multiuser Communications Relying on Flexible Intelligent Metasurfaces
Abstract:
A flexible intelligent metasurface (FIM) is composed of an array of low-cost radiating elements, each of which can independently radiate electromagnetic signals and flexibly adjust its position through a 3D surface-morphing process. In our system, an FIM is deployed at a base station (BS) that transmits to multiple single-antenna users. We formulate an optimization problem for minimizing the total downlink transmit power at the BS by jointly optimizing the transmit beamforming and the FIM's surface shape, subject to an individual signal-to-interference-plus-noise ratio (SINR) constraint for each user as well as to a constraint on the maximum morphing range of the FIM. To address this problem, an efficient alternating optimization method is proposed to iteratively update the FIM's surface shape and the transmit beamformer to gradually reduce the transmit power. Finally, our simulation results show that at a given data rate the FIM reduces the transmit power by about $3$ dB compared to conventional rigid 2D arrays.
中文摘要:柔性智能超表面系统通过交替优化波束成形和曲面形态,在满足用户信干噪比要求的同时显著降低基站发射功率,相比传统刚性阵列可减少约3分贝功耗。
English Summary: A flexible intelligent metasurface (FIM) system at a base station jointly optimizes beamforming and surface shape through alternating optimization to minimize transmit power while meeting user SINR requirements, achieving approximately 3 dB power reduction compared to rigid arrays.

Authors:Manisha Mukherjee, Sungchul Kim, Xiang Chen, Dan Luo, Tong Yu, Tung Mai
Title: From Documents to Dialogue: Building KG-RAG Enhanced AI Assistants
Abstract:
The Adobe Experience Platform AI Assistant is a conversational tool that enables organizations to interact seamlessly with proprietary enterprise data through a chatbot. However, due to access restrictions, Large Language Models (LLMs) cannot retrieve these internal documents, limiting their ability to generate accurate zero-shot responses. To overcome this limitation, we use a Retrieval-Augmented Generation (RAG) framework powered by a Knowledge Graph (KG) to retrieve relevant information from external knowledge sources, enabling LLMs to answer questions over private or previously unseen document collections. In this paper, we propose a novel approach for building a high-quality, low-noise KG. We apply several techniques, including incremental entity resolution using seed concepts, similarity-based filtering to deduplicate entries, assigning confidence scores to entity-relation pairs to filter for high-confidence pairs, and linking facts to source documents for provenance. Our KG-RAG system retrieves relevant tuples, which are added to the user prompts context before being sent to the LLM generating the response. Our evaluation demonstrates that this approach significantly enhances response relevance, reducing irrelevant answers by over 50% and increasing fully relevant answers by 88% compared to the existing production system.
Chinese: Adobe体验平台AI助手采用知识图谱增强的检索增强生成框架,使大型语言模型能够访问和利用外部知识,显著提升了回答相关性,将不相关回答减少超过50%,并将完全相关回答增加88%。
English: The Adobe Experience Platform AI Assistant uses a Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) framework to enable Large Language Models to access and utilize external knowledge, significantly improving response relevance by reducing irrelevant answers by over 50% and increasing fully relevant answers by 88%.

Authors:Wen Wang, Ruibing Hou, Hong Chang, Shiguang Shan, Xilin Chen
Title: MATS: An Audio Language Model under Text-only Supervision
Abstract:
Large audio-language models (LALMs), built upon powerful Large Language Models (LLMs), have exhibited remarkable audio comprehension and reasoning capabilities. However, the training of LALMs demands a large corpus of audio-language pairs, which requires substantial costs in both data collection and training resources. In this paper, we propose MATS, an audio-language multimodal LLM designed to handle Multiple Audio task using solely Text-only Supervision. By leveraging pre-trained audio-language alignment models such as CLAP, we develop a text-only training strategy that projects the shared audio-language latent space into LLM latent space, endowing the LLM with audio comprehension capabilities without relying on audio data during training. To further bridge the modality gap between audio and language embeddings within CLAP, we propose the Strongly-related noisy text with audio (Santa) mechanism. Santa maps audio embeddings into CLAP language embedding space while preserving essential information from the audio input. Extensive experiments demonstrate that MATS, despite being trained exclusively on text data, achieves competitive performance compared to recent LALMs trained on large-scale audio-language pairs.
中文摘要:提出的MATS模型通过利用预训练对齐模型和创新的模态融合机制,仅需文本监督即可处理多模态音频任务,在无需训练阶段音频数据的情况下实现了与主流模型相媲美的性能。
English Summary: The proposed MATS model enables multimodal audio-language tasks using text-only supervision by leveraging pre-trained alignment models and a novel mechanism to bridge modality gaps, achieving competitive performance without audio data during training.

Authors:Jingtong Yue, Zhiwei Lin, Xin Lin, Xiaoyu Zhou, Xiangtai Li, Lu Qi, Yongtao Wang, Ming-Hsuan Yang
Title: RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection
Abstract:
While recent low-cost radar-camera approaches have shown promising results in multi-modal 3D object detection, both sensors face challenges from environmental and intrinsic disturbances. Poor lighting or adverse weather conditions degrade camera performance, while radar suffers from noise and positional ambiguity. Achieving robust radar-camera 3D object detection requires consistent performance across varying conditions, a topic that has not yet been fully explored. In this work, we first conduct a systematic analysis of robustness in radar-camera detection on five kinds of noises and propose RobuRCDet, a robust object detection model in BEV. Specifically, we design a 3D Gaussian Expansion (3DGE) module to mitigate inaccuracies in radar points, including position, Radar Cross-Section (RCS), and velocity. The 3DGE uses RCS and velocity priors to generate a deformable kernel map and variance for kernel size adjustment and value distribution. Additionally, we introduce a weather-adaptive fusion module, which adaptively fuses radar and camera features based on camera signal confidence. Extensive experiments on the popular benchmark, nuScenes, show that our model achieves competitive results in regular and noisy conditions.
中文摘要:本研究提出RobuRCDet鲁棒雷达-相机三维目标检测模型,通过3D高斯扩展模块修正雷达点云数据,结合天气自适应融合模块动态整合多模态特征,在nuScenes数据集上实现了常规与噪声条件下的优越检测性能。
English Summary: This study introduces RobuRCDet, a robust radar-camera 3D object detection model that addresses sensor limitations through a 3D Gaussian Expansion module for radar point refinement and a weather-adaptive fusion module for dynamic feature integration, achieving competitive performance in both normal and adverse conditions on the nuScenes benchmark.

Authors:Frederic Kirstein, Muneeb Khan, Jan Philip Wahle, Terry Ruas, Bela Gipp
Title: You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations
Abstract:
Meeting summarization suffers from limited high-quality data, mainly due to privacy restrictions and expensive collection processes. We address this gap with FAME, a dataset of 500 meetings in English and 300 in German produced by MIMIC, our new multi-agent meeting synthesis framework that generates meeting transcripts on a given knowledge source by defining psychologically grounded participant profiles, outlining the conversation, and orchestrating a large language model (LLM) debate. A modular post-processing step refines these outputs, mitigating potential repetitiveness and overly formal tones, ensuring coherent, credible dialogues at scale. We also propose a psychologically grounded evaluation framework assessing naturalness, social behavior authenticity, and transcript difficulties. Human assessments show that FAME approximates real-meeting spontaneity (4.5/5 in naturalness), preserves speaker-centric challenges (3/5 in spoken language), and introduces richer information-oriented difficulty (4/5 in difficulty). These findings highlight that FAME is a good and scalable proxy for real-world meeting conditions. It enables new test scenarios for meeting summarization research and other conversation-centric applications in tasks requiring conversation data or simulating social scenarios under behavioral constraints.
中文: FAME数据集通过MIMIC框架生成心理基础真实的英文和德文会议转录,解决了高质量会议数据稀缺的问题,人类评估证实其作为真实会议可扩展替代方案的有效性。
English: The FAME dataset, created using the MIMIC framework, addresses the scarcity of high-quality meeting data by generating realistic, psychologically grounded transcripts in English and German, with human evaluations confirming its effectiveness as a scalable proxy for real meetings.

Authors:Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva, Murja Sani Gadanya, Robert Geislinger, Bela Gipp, Oumaima Hourrane, Oana Ignat, Falalu Ibrahim Lawan, Rooweither Mabuya, Rahmad Mahendra, Vukosi Marivate, Alexander Panchenko, Andrew Piper, Charles Henrique Porto Ferreira, Vitaly Protasov, Samuel Rutunda, Manish Shrivastava, Aura Cristina Udrea, Lilian Diana Awuor Wanzare, Sophie Wu, Florian Valentin Wunderlich, Hanif Muhammad Zhafran, Tianhui Zhang, Yi Zhou, Saif M. Mohammad
Title: BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages
Abstract:
People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition--an umbrella term for several NLP tasks--impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which often lack high-quality annotated datasets. In this paper, we present BRIGHTER--a collection of multi-labeled, emotion-annotated datasets in 28 different languages and across several domains. BRIGHTER primarily covers low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances labeled by fluent speakers. We highlight the challenges related to the data collection and annotation processes, and then report experimental results for monolingual and crosslingual multi-label emotion identification, as well as emotion intensity recognition. We analyse the variability in performance across languages and text domains, both with and without the use of LLMs, and show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
中文: 本文介绍了BRIGHTER数据集,涵盖28种语言的多标签情感标注,旨在弥补资源匮乏语言在情感识别研究中的不足,并通过跨语言实验和性能差异分析验证了其有效性。
English: The paper introduces BRIGHTER, a multi-labeled emotion dataset covering 28 languages to address the research gap in emotion recognition for under-resourced languages, demonstrating its utility through cross-lingual experiments and analysis of performance variability.

Authors:Junda Wu, Yuxin Xiong, Xintong Li, Yu Xia, Ruoyu Wang, Yu Wang, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Jingbo Shang, Julian McAuley
Title: Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent
Abstract:
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets. Unlike pre-training, where MLLMs receive rich visual-text alignment, instruction-tuning is often text-driven with weaker visual supervision, leading to the degradation of pre-trained visual understanding and causing visual forgetting. Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue, often compressing visual representations and prioritizing task alignment over visual retention, which further worsens visual forgetting. To overcome this limitation, we introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation richness, interpreting this degradation through the information bottleneck principle as excessive compression that leads to the degradation of crucial pre-trained visual knowledge. Building on this view, we propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations while mitigating the over-compression effects described by the information bottleneck. By explicitly disentangling the optimization of visual understanding from task-specific alignment, MDGD preserves pre-trained visual knowledge while enabling efficient task adaptation. To enable lightweight instruction-tuning, we further develop a memory-efficient fine-tuning approach using gradient masking, which selectively updates a subset of model parameters to enable parameter-efficient fine-tuning (PEFT), reducing computational overhead while preserving rich visual representations. Extensive experiments across various downstream tasks and backbone MLLMs demonstrate that MDGD effectively mitigates visual forgetting from pre-trained tasks while enabling strong adaptation to new tasks.
中文摘要: 本文提出一种模态解耦梯度下降方法,通过在指令微调过程中保持视觉表征的有效秩并采用选择性参数更新来防止过度压缩,从而有效缓解多模态大模型的视觉遗忘问题。
English Summary: This paper introduces a modality-decoupled gradient descent method to mitigate visual forgetting in MLLMs during instruction-tuning by preserving the effective rank of visual representations and preventing over-compression through selective parameter updates.

Authors:William Chen, Jinchuan Tian, Yifan Peng, Brian Yan, Chao-Han Huck Yang, Shinji Watanabe
Title: OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
Abstract:
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released on https://huggingface.co/collections/espnet/owls-scaling-laws-for-speech-recognition-and-translation-67ab7f991c194065f057ce8d for future studies.
中文: OWLS项目推出了一套可扩展的多语言语音模型,研究表明神经缩放定律能提升低资源语言的性能,并减少语音技术中的偏见。
English: The OWLS project introduces a scalable suite of multilingual speech models, demonstrating that neural scaling laws enhance performance for low-resource languages and reduce bias in speech technologies.

Authors:Runze Liu, Junqi Gao, Jian Zhao, Kaiyan Zhang, Xiu Li, Biqing Qi, Wanli Ouyang, Bowen Zhou
Title: Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
Abstract:
Test-Time Scaling (TTS) is an important method for improving the performance of Large Language Models (LLMs) by using additional computation during the inference phase. However, current studies do not systematically analyze how policy models, Process Reward Models (PRMs), and problem difficulty influence TTS. This lack of analysis limits the understanding and practical use of TTS methods. In this paper, we focus on two core questions: (1) What is the optimal approach to scale test-time computation across different policy models, PRMs, and problem difficulty levels? (2) To what extent can extended computation improve the performance of LLMs on complex tasks, and can smaller language models outperform larger ones through this approach? Through comprehensive experiments on MATH-500 and challenging AIME24 tasks, we have the following observations: (1) The compute-optimal TTS strategy is highly dependent on the choice of policy model, PRM, and problem difficulty. (2) With our compute-optimal TTS strategy, extremely small policy models can outperform larger models. For example, a 1B LLM can exceed a 405B LLM on MATH-500. Moreover, on both MATH-500 and AIME24, a 0.5B LLM outperforms GPT-4o, a 3B LLM surpasses a 405B LLM, and a 7B LLM beats o1 and DeepSeek-R1, while with higher inference efficiency. These findings show the significance of adapting TTS strategies to the specific characteristics of each task and model and indicate that TTS is a promising approach for enhancing the reasoning abilities of LLMs.
中文: 测试时扩展(TTS)通过根据策略模型、奖励模型和任务难度优化计算策略,显著提升大型语言模型性能,使小模型能以更高效率超越大模型。
English: Test-Time Scaling (TTS) enhances LLM performance by optimizing computation strategies based on policy models, reward models, and task difficulty, enabling smaller models to outperform larger ones with higher efficiency.

Authors:Xing Jia, Jiancheng An, Hao Liu, Lu Gan, Marco Di Renzo, Mérouane Debbah, Chau Yuen
Title: Stacked Intelligent Metasurface Enabled Near-Field Multiuser Beamfocusing in the Wave Domain
Abstract:
Intelligent surfaces represent a breakthrough technology capable of customizing the wireless channel cost-effectively. However, the existing works generally focus on planar wavefront, neglecting near-field spherical wavefront characteristics caused by large array aperture and high operation frequencies in the terahertz (THz). Additionally, the single-layer reconfigurable intelligent surface (RIS) lacks the signal processing ability to mitigate the computational complexity at the base station (BS). To address this issue, we introduce a novel stacked intelligent metasurfaces (SIM) comprised of an array of programmable metasurface layers. The SIM aims to substitute conventional digital baseband architecture to execute computing tasks with ultra-low processing delay, albeit with a reduced number of radio-frequency (RF) chains and low-resolution digital-to-analog converters. In this paper, we present a SIM-aided multiuser multiple-input single-output (MU-MISO) near-field system, where the SIM is integrated into the BS to perform beamfocusing in the wave domain and customize an end-to-end channel with minimized inter-user interference. Finally, the numerical results demonstrate that near-field communication achieves superior spatial gain over the far-field, and the SIM effectively suppresses inter-user interference as the wireless signals propagate through it.
中文摘要:智能表面技术虽能经济地定制无线信道,但现有研究忽略了近场球形波前特性且缺乏信号处理能力,为此提出的堆叠智能超表面(SIM)通过波域波束聚焦有效抑制用户间干扰,在近场通信中实现更优性能。
English Summary: Intelligent surfaces offer a cost-effective way to customize wireless channels, but current designs overlook near-field effects and lack processing capabilities, leading to the introduction of stacked intelligent metasurfaces (SIM) that enable efficient near-field beamfocusing and interference suppression with minimal hardware.

Authors:Mengdi Liu, Zhangyang Gao, Hong Chang, Stan Z. Li, Shiguang Shan, Xilin Chen
Title: G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion
Abstract:
Understanding how genes influence phenotype across species is a fundamental challenge in genetic engineering, which will facilitate advances in various fields such as crop breeding, conservation biology, and personalized medicine. However, current phenotype prediction models are limited to individual species and expensive phenotype labeling process, making the genotype-to-phenotype prediction a highly domain-dependent and data-scarce problem. To this end, we suggest taking images as morphological proxies, facilitating cross-species generalization through large-scale multimodal pretraining. We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA considering two critical evolutionary signals, i.e., multiple sequence alignments (MSA) and environmental contexts. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency. Extensive experiments show that integrating evolutionary signals with environmental context enriches the model's understanding of phenotype variability across species, thereby offering a valuable and promising exploration into advanced AI-assisted genomic analysis.
中文摘要:本研究提出了首个基因型到表型的扩散模型G2PDiffusion,通过整合多序列比对和环境背景等进化信号,从DNA生成形态图像,以提升跨物种表型预测的准确性。
English Summary: The study introduces G2PDiffusion, the first diffusion model that generates morphological images from DNA by integrating evolutionary signals like multiple sequence alignments and environmental contexts to improve cross-species phenotype prediction.

Authors:Kiarash Banihashem, Xiang Chen, MohammadTaghi Hajiaghayi, Sungchul Kim, Kanak Mahadik, Ryan Rossi, Tong Yu
Title: Pandora with Inaccurate Priors
Abstract:
We investigate the role of inaccurate priors for the classical Pandora's box problem. In the classical Pandora's box problem we are given a set of boxes each with a known cost and an unknown value sampled from a known distribution. We investigate how inaccuracies in the beliefs can affect existing algorithms. Specifically, we assume that the knowledge of the underlying distribution has a small error in the Kolmogorov distance, and study how this affects the utility obtained by the optimal algorithm.
Chinese: 本研究探讨了先验信念不准确对经典潘多拉魔盒问题的影响,分析了在科尔莫戈罗夫距离下分布知识的小误差如何影响最优算法的效用。
English: This study examines the impact of inaccurate prior beliefs on the classical Pandora's box problem, analyzing how small errors in the distribution knowledge affect the utility of optimal algorithms under Kolmogorov distance.

Authors:Songwen Hu, Ryan A. Rossi, Tong Yu, Junda Wu, Handong Zhao, Sungchul Kim, Shuai Li
Title: Interactive Visualization Recommendation with Hier-SUCB
Abstract:
Visualization recommendation aims to enable rapid visual analysis of massive datasets. In real-world scenarios, it is essential to quickly gather and comprehend user preferences to cover users from diverse backgrounds, including varying skill levels and analytical tasks. Previous approaches to personalized visualization recommendations are non-interactive and rely on initial user data for new users. As a result, these models cannot effectively explore options or adapt to real-time feedback. To address this limitation, we propose an interactive personalized visualization recommendation (PVisRec) system that learns on user feedback from previous interactions. For more interactive and accurate recommendations, we propose Hier-SUCB, a contextual combinatorial semi-bandit in the PVisRec setting. Theoretically, we show an improved overall regret bound with the same rank of time but an improved rank of action space. We further demonstrate the effectiveness of Hier-SUCB through extensive experiments where it is comparable to offline methods and outperforms other bandit algorithms in the setting of visualization recommendation.
中文摘要:本研究提出的交互式PVisRec系统通过Hier-SUCB算法,能够根据用户实时反馈进行个性化可视化推荐,在性能上超越了传统静态方法及其他赌博算法。
English Summary: The proposed interactive PVisRec system with Hier-SUCB algorithm enables real-time personalized visualization recommendations by learning from user feedback, achieving improved performance over static methods and other bandit approaches.

Authors:Yu Qiu, Xin Lin, Jingbo Wang, Xiangtai Li, Lu Qi, Ming-Hsuan Yang
Title: UMC: Unified Resilient Controller for Legged Robots with Joint Malfunctions
Abstract:
Adaptation to unpredictable damages is crucial for autonomous legged robots, yet existing methods based on multi-policy or meta-learning frameworks face challenges like limited generalization and complex maintenance. To address this issue, we first analyze and summarize eight types of damage scenarios, including sensor failures and joint malfunctions. Then, we propose a novel, model-free, two-stage training framework, Unified Malfunction Controller (UMC), incorporating a masking mechanism to enhance damage resilience. Specifically, the model is initially trained with normal environments to ensure robust performance under standard conditions. In the second stage, we use masks to prevent the legged robot from relying on malfunctioning limbs, enabling adaptive gait and movement adjustments upon malfunction. Experimental results demonstrate that our approach improves the task completion capability by an average of 36% for the transformer and 39% for the MLP across three locomotion tasks. The source code and trained models will be made available to the public.
中文:提出的统一故障控制器(UMC)采用两阶段训练和掩码机制,增强了腿式机器人对各种损伤的适应能力,实验中将任务完成率平均提高了36%以上。
English: The proposed Unified Malfunction Controller (UMC) employs a two-stage training approach with a masking mechanism to enhance legged robots' resilience to various damages, improving task completion by over 36% on average in experiments.

Authors:Wenhao Wang, Mengying Yuan, Zijie Yu, Guangyi Liu, Rui Ye, Tian Jin, Siheng Chen, Yanfeng Wang
Title: MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users
Abstract:
The advancement of mobile GUI agents has opened new opportunities for automating tasks on mobile devices. Training these agents requires large-scale high-quality data, which is prohibitively expensive when relying on human labor. Given the vast population of global mobile phone users, if automated data collection from them becomes feasible, the resulting data volume and the subsequently trained mobile agents could reach unprecedented levels. Nevertheless, two major challenges arise: (1) extracting user instructions without human intervention and (2) utilizing distributed user data while preserving privacy. To tackle these challenges, we propose MobileA3gent, a collaborative framework that trains mobile GUI Agents using decentralized self-sourced data from diverse users. The framework comprises two components, each targeting a specific challenge: (1) Auto-Annotation, which enables the automatic collection of high-quality datasets during users' routine phone usage with minimal cost. (2) FedVLM-A, which enhances federated VLM training under non-IID distributions by incorporating adapted global aggregation based on both episode-level and step-level variability. Extensive experiments prove that MobileA3gent achieves superior performance over traditional approaches at only 1% of the cost, highlighting its potential for real-world applications
Chinese: MobileA3gent提出了一种协作框架,通过自动化数据标注和改进联邦视觉语言模型训练,以仅1%的传统成本实现高性能移动GUI代理的开发,同时保障隐私和成本效益。
English: MobileA3gent introduces a collaborative framework that automates data annotation and enhances federated visual-language model training, enabling cost-effective and privacy-preserving development of mobile GUI agents with superior performance at only 1% of traditional costs.

Authors:Shenao Wang, Yanjie Zhao, Yinglin Xie, Zhao Liu, Xinyi Hou, Quanchen Zou, Haoyu Wang
Title: Towards Reliable Vector Database Management Systems: A Software Testing Roadmap for 2030
Abstract:
The rapid growth of Large Language Models (LLMs) and AI-driven applications has propelled Vector Database Management Systems (VDBMSs) into the spotlight as a critical infrastructure component. VDBMS specializes in storing, indexing, and querying dense vector embeddings, enabling advanced LLM capabilities such as retrieval-augmented generation, long-term memory, and caching mechanisms. However, the explosive adoption of VDBMS has outpaced the development of rigorous software testing methodologies tailored for these emerging systems. Unlike traditional databases optimized for structured data, VDBMS face unique testing challenges stemming from the high-dimensional nature of vector data, the fuzzy semantics in vector search, and the need to support dynamic data scaling and hybrid query processing. In this paper, we begin by conducting an empirical study of VDBMS defects and identify key challenges in test input generation, oracle definition, and test evaluation. Drawing from these insights, we propose the first comprehensive research roadmap for developing effective testing methodologies tailored to VDBMS. By addressing these challenges, the software testing community can contribute to the development of more reliable and trustworthy VDBMS, enabling the full potential of LLMs and data-intensive AI applications.
中文: 向量数据库管理系统在AI应用中的迅速普及凸显了专用测试方法的缺失,本文通过分析其高维数据和模糊语义等独特挑战,提出了首个全面研究路线图以提升系统可靠性。
English: The rapid adoption of Vector Database Management Systems (VDBMS) for AI applications has exposed a critical gap in specialized testing methodologies, prompting this paper to propose a research roadmap addressing unique challenges like high-dimensional data and fuzzy search semantics to enhance system reliability.

Authors:Yuanyuan Xu, Wenjie Zhang, Ying Zhang, Xuemin Lin, Xiwei Xu
Title: Unlocking Multi-Modal Potentials for Link Prediction on Dynamic Text-Attributed Graphs
Abstract:
Dynamic Text-Attributed Graphs (DyTAGs) are a novel graph paradigm that captures evolving temporal events (edges) alongside rich textual attributes. Existing studies can be broadly categorized into TGNN-driven and LLM-driven approaches, both of which encode textual attributes and temporal structures for DyTAG representation. We observe that DyTAGs inherently comprise three distinct modalities: temporal, textual, and structural, often exhibiting completely disjoint distributions. However, the first two modalities are largely overlooked by existing studies, leading to suboptimal performance. To address this, we propose MoMent, a multi-modal model that explicitly models, integrates, and aligns each modality to learn node representations for link prediction. Given the disjoint nature of the original modality distributions, we first construct modality-specific features and encode them using individual encoders to capture correlations across temporal patterns, semantic context, and local structures. Each encoder generates modality-specific tokens, which are then fused into comprehensive node representations with a theoretical guarantee. To avoid disjoint subspaces of these heterogeneous modalities, we propose a dual-domain alignment loss that first aligns their distributions globally and then fine-tunes coherence at the instance level. This enhances coherent representations from temporal, textual, and structural views. Extensive experiments across seven datasets show that MoMent achieves up to 17.28% accuracy improvement and up to 31x speed-up against eight baselines.
中文:MoMent模型通过模态专用编码器和双域对齐损失,显式建模、整合并对齐动态文本属性图中的时间、文本和结构三种模态,在多个数据集上实现了准确率和效率的显著提升。
English: The proposed MoMent model addresses limitations in Dynamic Text-Attributed Graphs by explicitly modeling, integrating, and aligning temporal, textual, and structural modalities through modality-specific encoders and a dual-domain alignment loss, achieving significant improvements in accuracy and efficiency across multiple datasets.

Authors:Guikun Chen, Xu Zhang, Xiaolin Hu, Yong Liu, Yi Yang, Wenguan Wang
Title: Chemical knowledge-informed framework for privacy-aware retrosynthesis learning
Abstract:
Chemical reaction data is a pivotal asset, driving advances in competitive fields such as pharmaceuticals, materials science, and industrial chemistry. Its proprietary nature renders it sensitive, as it often includes confidential insights and competitive advantages organizations strive to protect. However, in contrast to this need for confidentiality, the current standard training paradigm for machine learning-based retrosynthesis gathers reaction data from multiple sources into one single edge to train prediction models. This paradigm poses considerable privacy risks as it necessitates broad data availability across organizational boundaries and frequent data transmission between entities, potentially exposing proprietary information to unauthorized access or interception during storage and transfer. In the present study, we introduce the chemical knowledge-informed framework (CKIF), a privacy-preserving approach for learning retrosynthesis models. CKIF enables distributed training across multiple chemical organizations without compromising the confidentiality of proprietary reaction data. Instead of gathering raw reaction data, CKIF learns retrosynthesis models through iterative, chemical knowledge-informed aggregation of model parameters. In particular, the chemical properties of predicted reactants are leveraged to quantitatively assess the observable behaviors of individual models, which in turn determines the adaptive weights used for model aggregation. On a variety of reaction datasets, CKIF outperforms several strong baselines by a clear margin.
Chinese: 本研究提出了CKIF框架,通过基于化学知识的模型参数聚合实现跨组织分布式训练,在保护专有反应数据隐私的同时显著提升了逆合成预测性能,明显优于现有基线方法。
English: The study introduces CKIF, a privacy-preserving framework that enables distributed training of retrosynthesis models across organizations without sharing proprietary chemical reaction data, outperforming existing methods by leveraging chemical knowledge for secure model aggregation.

Authors:Minh Duc Vu, Jieshan Chen, Zhenchang Xing, Qinghua Lu, Xiwei Xu, Qian Fu
Title: FactFlow: Automatic Fact Sheet Generation and Customization from Tabular Dataset via AI Chain Design & Implementation
Abstract:
With the proliferation of data across various domains, there is a critical demand for tools that enable non-experts to derive meaningful insights without deep data analysis skills. To address this need, existing automatic fact sheet generation tools offer heuristic-based solutions to extract facts and generate stories. However, they inadequately grasp the semantics of data and struggle to generate narratives that fully capture the semantics of the dataset or align the fact sheet with specific user needs. Addressing these shortcomings, this paper introduces \tool, a novel tool designed for the automatic generation and customisation of fact sheets. \tool applies the concept of collaborative AI workers to transform raw tabular dataset into comprehensive, visually compelling fact sheets. We define effective taxonomy to profile AI worker for specialised tasks. Furthermore, \tool empowers users to refine these fact sheets through intuitive natural language commands, ensuring the final outputs align closely with individual preferences and requirements. Our user evaluation with 18 participants confirms that \tool not only surpasses state-of-the-art baselines in automated fact sheet production but also provides a positive user experience during customization tasks.
Chinese: 本文介绍了一种名为\tool的新型工具,它通过协作式AI工作者将原始数据转化为全面且视觉吸引力强的事实报告,有效解决了现有工具在理解数据语义和适应用户需求方面的不足,并支持用户通过自然语言指令进行个性化定制。
English: This paper introduces a novel tool called \tool that automatically generates and customizes fact sheets from raw data using collaborative AI workers, overcoming the limitations of existing tools by better understanding data semantics and allowing user refinement through natural language commands.

Authors:Qiuming Zhao, Guangzhi Sun, Chao Zhang
Title: Low-Rank and Sparse Model Merging for Multi-Lingual Speech Recognition and Translation
Abstract:
Language diversity presents a significant challenge in speech-to-text (S2T) tasks, such as automatic speech recognition and translation. Traditional multi-lingual multi-task training approaches aim to address this by jointly optimising multiple speech recognition and translation tasks across various languages. While models like Whisper, built on these strategies, demonstrate strong performance, they still face issues of high computational cost, language interference, suboptimal training configurations, and limited extensibility. To overcome these challenges, we introduce LoRS-Merging (low-rank and sparse model merging), a novel technique designed to efficiently integrate models trained on different languages or tasks while preserving performance and reducing computational overhead. LoRS-Merging combines low-rank and sparse pruning to retain essential structures while eliminating redundant parameters, mitigating language interference, and enhancing extensibility. Experimental results across 10 languages demonstrate that LoRS-Merging significantly outperforms multi-lingual multi-task training, sequential training, and other merging methods, achieving over 20% improvement in normalised performance. Our findings suggest that model merging, particularly LoRS-Merging, is a scalable and effective complement to traditional multi-lingual training strategies for S2T applications.
Chinese: 针对多语言语音转文本任务中的计算成本高和语言干扰等问题,创新的LoRS-Merging技术通过结合低秩和稀疏剪枝有效整合模型,在10种语言上实现了超过20%的性能提升。
English: To address challenges like computational cost and language interference in multilingual speech-to-text tasks, the novel LoRS-Merging technique efficiently integrates models by combining low-rank and sparse pruning, achieving over 20% performance improvement across 10 languages.

Authors:Jialiang Hou, Xin Zhou, Neng Pan, Ang Li, Yuxiang Guan, Chao Xu, Zhongxue Gan, Fei Gao
Title: Primitive-Swarm: An Ultra-lightweight and Scalable Planner for Large-scale Aerial Swarms
Abstract:
Achieving large-scale aerial swarms is challenging due to the inherent contradictions in balancing computational efficiency and scalability. This paper introduces Primitive-Swarm, an ultra-lightweight and scalable planner designed specifically for large-scale autonomous aerial swarms. The proposed approach adopts a decentralized and asynchronous replanning strategy. Within it is a novel motion primitive library consisting of time-optimal and dynamically feasible trajectories. They are generated utlizing a novel time-optimial path parameterization algorithm based on reachability analysis (TOPP-RA). Then, a rapid collision checking mechanism is developed by associating the motion primitives with the discrete surrounding space according to conflicts. By considering both spatial and temporal conflicts, the mechanism handles robot-obstacle and robot-robot collisions simultaneously. Then, during a replanning process, each robot selects the safe and minimum cost trajectory from the library based on user-defined requirements. Both the time-optimal motion primitive library and the occupancy information are computed offline, turning a time-consuming optimization problem into a linear-complexity selection problem. This enables the planner to comprehensively explore the non-convex, discontinuous 3-D safe space filled with numerous obstacles and robots, effectively identifying the best hidden path. Benchmark comparisons demonstrate that our method achieves the shortest flight time and traveled distance with a computation time of less than 1 ms in dense environments. Super large-scale swarm simulations, involving up to 1000 robots, running in real-time, verify the scalability of our method. Real-world experiments validate the feasibility and robustness of our approach. The code will be released to foster community collaboration.
中文: 本文提出Primitive-Swarm这一超轻量级大规模无人机群规划器,通过去中心化策略和时间最优运动基元库将复杂规划转化为线性选择问题,在密集环境中实现毫秒级计算并保持实时性能。
English: This paper presents Primitive-Swarm, an ultra-lightweight planner for large-scale aerial swarms that transforms complex motion planning into a linear-selection problem through a decentralized strategy and time-optimal motion primitives, achieving real-time performance with millisecond computation in dense environments.

Authors:Yuanyuan Xu, Wenjie Zhang, Xuemin Lin, Ying Zhang
Title: UniDyG: A Unified and Effective Representation Learning Approach for Large Dynamic Graphs
Abstract:
Dynamic graphs are formulated in continuous-time or discrete-time dynamic graphs. They differ in temporal granularity: Continuous-Time Dynamic Graphs (CTDGs) exhibit rapid, localized changes, while Discrete-Time Dynamic Graphs (DTDGs) show gradual, global updates. This difference leads to isolated developments in representation learning for each type. To advance representation learning, recent research attempts to design a unified model capable of handling both CTDGs and DTDGs. However, it typically focuses on local dynamic propagation for temporal structure learning in the time domain, failing to accurately capture the structural evolution associated with each temporal granularity. In addition, existing works-whether specific or unified-often overlook the issue of temporal noise, compromising the model robustness and effectiveness. To better model both types of dynamic graphs, we propose UniDyG, a unified and effective representation learning approach, which scales to large dynamic graphs. We first propose a novel Fourier Graph Attention (FGAT) mechanism that can model local and global structural correlations based on recent neighbors and complex-number selective aggregation, while theoretically ensuring consistent representations of dynamic graphs over time. Based on approximation theory, we demonstrate that FGAT is well-suited to capture the underlying structures in CTDGs and DTDGs. We further enhance FGAT to resist temporal noise by designing an energy-gated unit, which adaptively filters out high-frequency noise according to the energy. Last, we leverage our FGAT mechanisms for temporal structure learning and employ the frequency-enhanced linear function for node-level dynamic updates, facilitating the generation of high-quality temporal embeddings. Extensive experiments show that our UniDyG achieves an average improvement of 14.4% over sixteen baselines across nine dynamic graphs.
中文: 摘要提出UniDyG这一统一表征学习模型,它采用傅里叶图注意力机制和能量门控噪声过滤技术,能有效捕捉连续时间与离散时间动态图的结构演化,在九类动态图数据上相较基线模型平均提升14.4%性能。
English: The abstract introduces UniDyG, a unified representation learning model that employs a Fourier Graph Attention mechanism and energy-gated noise filtering to effectively capture structural evolution in both continuous-time and discrete-time dynamic graphs, achieving a 14.4% average improvement over baselines.

Authors:Shulin Huang, Linyi Yang, Yan Song, Shuang Chen, Leyang Cui, Ziyu Wan, Qingcheng Zeng, Ying Wen, Kun Shao, Weinan Zhang, Jun Wang, Yue Zhang
Title: ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
Abstract:
Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.
中文:ThinkBench是一种新颖的评估框架,通过动态生成分布外数据集来解决大语言模型评估中的数据污染和答案泄露问题,为16个大语言模型和4个推理模型在统一条件下提供了可靠的推理能力评估。
English: ThinkBench is a novel framework that addresses data contamination and answer leakage in LLM evaluation by dynamically generating out-of-distribution datasets, enabling reliable assessment of reasoning capabilities across 16 LLMs and 4 PRMs under uniform conditions.

Authors:Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji
Title: Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective
Abstract:
In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of ''$\textbf{reconstruction error explosion}$'' in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction of 52.10 in perplexity for the 70$\%$ sparse LLaMA2-7B model obtained via Wanda, improves average zero-shot accuracy by 10.50$\%$, and delivers speedups of 2.63$\times$ and 2.23$\times$ on CPU and GPU, respectively.
Chinese: 本文提出一种理论方法,通过采用单调递增的算术序列进行分层稀疏度分配,有效解决了大型语言模型稀疏化过程中的重构误差爆炸问题,显著提升了多种架构和压缩技术下的模型性能与效率。
English: This paper proposes a theoretical solution to prevent reconstruction error explosion in sparsifying large language models by using a monotonically increasing arithmetic progression for layer-wise sparsity allocation, which significantly enhances model performance and efficiency across various architectures and compression techniques.

Authors:Weizhong Huang, Yuxin Zhang, Xiawu Zheng, Fei Chao, Rongrong Ji
Title: Towards Efficient Automatic Self-Pruning of Large Language Models
Abstract:
Despite exceptional capabilities, Large Language Models (LLMs) still face deployment challenges due to their enormous size. Post-training structured pruning is a promising solution that prunes LLMs without the need for retraining, reducing computational overhead, and it is hardware-deployment friendly. However, the training-free nature of post-training structured pruning leads to significant performance degradation. We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer. Meanwhile, we find that LLMs may have prior knowledge about their own redundancy. Based on this insight, we introduce $\textbf{Self-Pruner}$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates. Specifically, $\textbf{Self-Pruner}$ leverages LLMs to autonomously execute the entire evolutionary search process to search for pruning rate configurations. In this process, LLMs are used to generate populations, select parent solutions from the current population, and perform crossover and mutation operations to produce offspring solutions. In this way, LLMs automatically generate and evaluate a large number of candidate solutions, effectively converging to find the pruning rate configurations with minimal human intervention. Extensive experiments demonstrate $\textbf{Self-Pruner}$'s better performance compared to existing state-of-the-art methods. Notably, $\textbf{Self-Pruner}$ prunes LLaMA-2-70B to 49B level with only 0.80$\%$ drop in accuracy across seven commonsense reasoning tasks, achieving a 1.39$\times$ speedup on NVIDIA A100 80GB GPU. Further pruning to 35B level resulted in only a 3.80$\%$ decrease in accuracy while obtaining a 1.70$\times$ speedup.
中文: Self-Pruner是一种自动剪枝框架,通过让大语言模型自主执行进化搜索来确定各层剪枝率,在实现显著模型压缩的同时保持优异性能,并在硬件上获得可观加速效果。
English: Self-Pruner is an automatic pruning framework that enables Large Language Models to autonomously determine layer-wise pruning rates through evolutionary search, achieving significant model compression with minimal performance loss and notable speed improvements on hardware.

Authors:Ataberk Olgun, F. Nisa Bostanci, Ismail Emir Yuksel, Oguzhan Canpolat, Haocong Luo, Geraldo F. Oliveira, A. Giray Yaglikci, Minesh Patel, Onur Mutlu
Title: Variable Read Disturbance: An Experimental Analysis of Temporal Variation in DRAM Read Disturbance
Abstract:
Modern DRAM chips are subject to read disturbance errors. State-of-the-art read disturbance mitigations rely on accurate and exhaustive characterization of the read disturbance threshold (RDT) (e.g., the number of aggressor row activations needed to induce the first RowHammer or RowPress bitflip) of every DRAM row (of which there are millions or billions in a modern system) to prevent read disturbance bitflips securely and with low overhead. We experimentally demonstrate for the first time that the RDT of a DRAM row significantly and unpredictably changes over time. We call this new phenomenon variable read disturbance (VRD). Our experiments using 160 DDR4 chips and 4 HBM2 chips from three major manufacturers yield two key observations. First, it is very unlikely that relatively few RDT measurements can accurately identify the RDT of a DRAM row. The minimum RDT of a DRAM row appears after tens of thousands of measurements (e.g., up to 94,467), and the minimum RDT of a DRAM row is 3.5X smaller than the maximum RDT observed for that row. Second, the probability of accurately identifying a row's RDT with a relatively small number of measurements reduces with increasing chip density or smaller technology node size. Our empirical results have implications for the security guarantees of read disturbance mitigation techniques: if the RDT of a DRAM row is not identified accurately, these techniques can easily become insecure. We discuss and evaluate using a guardband for RDT and error-correcting codes for mitigating read disturbance bitflips in the presence of RDTs that change unpredictably over time. We conclude that a >10% guardband for the minimum observed RDT combined with SECDED or Chipkill-like SSC error-correcting codes could prevent read disturbance bitflips at the cost of large read disturbance mitigation performance overheads (e.g., 45% performance loss for an RDT guardband of 50%).
中文: 现代DRAM芯片存在可变读取干扰(VRD)现象,即读取干扰阈值随时间不可预测地变化,这使得现有缓解技术必须采用大保护带和纠错码来确保安全,但会显著牺牲性能。
English: Modern DRAM chips exhibit variable read disturbance (VRD), where the read disturbance threshold changes unpredictably over time, making current mitigation techniques insecure unless they incorporate large guardbands and error-correcting codes at significant performance cost.

Authors:Shenao Wang, Yanjie Zhao, Zhao Liu, Quanchen Zou, Haoyu Wang
Title: SoK: Understanding Vulnerabilities in the Large Language Model Supply Chain
Abstract:
Large Language Models (LLMs) transform artificial intelligence, driving advancements in natural language understanding, text generation, and autonomous systems. The increasing complexity of their development and deployment introduces significant security challenges, particularly within the LLM supply chain. However, existing research primarily focuses on content safety, such as adversarial attacks, jailbreaking, and backdoor attacks, while overlooking security vulnerabilities in the underlying software systems. To address this gap, this study systematically analyzes 529 vulnerabilities reported across 75 prominent projects spanning 13 lifecycle stages. The findings show that vulnerabilities are concentrated in the application (50.3%) and model (42.7%) layers, with improper resource control (45.7%) and improper neutralization (25.1%) identified as the leading root causes. Additionally, while 56.7% of the vulnerabilities have available fixes, 8% of these patches are ineffective, resulting in recurring vulnerabilities. This study underscores the challenges of securing the LLM ecosystem and provides actionable insights to guide future research and mitigation strategies.
中文摘要:大语言模型在其软件供应链中存在严重安全漏洞,主要问题集中在应用层和模型层,由资源控制不当和输入处理缺陷导致,凸显了加强生态系统安全的迫切需求。
English Summary: Large Language Models face significant security vulnerabilities in their software supply chain, with most issues concentrated in application and model layers due to improper resource control and neutralization, highlighting the need for improved ecosystem security.

Authors:Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun MA, Chao Zhang
Title: video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Abstract:
While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.
中文: 本文提出了首个面向通用视频理解任务的开源推理增强音视频大模型video-SALMONN-o1,通过构建推理密集型数据集和提出过程直接偏好优化方法,在不同视频推理基准上实现了3-8%的准确率提升。
English: This paper introduces video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual large language model for general video understanding, which achieves significant accuracy improvements through novel methods including a reasoning-intensive dataset and process direct preference optimization.

Authors:Shaoshen Chen, Yangning Li, Zishan Xu, Yinghui Li, Xin Su, Zifei Shan, Hai-tao Zheng
Title: DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens
Abstract:
Large Language Models (LLMs) face computational inefficiencies and redundant processing when handling long context inputs, prompting a focus on compression techniques. While existing semantic vector-based compression methods achieve promising performance, these methods fail to account for the intrinsic information density variations between context chunks, instead allocating soft tokens uniformly across context chunks. This uniform distribution inevitably diminishes allocation to information-critical regions. To address this, we propose Dynamic Allocation of Soft Tokens (DAST), a simple yet effective method that leverages the LLM's intrinsic understanding of contextual relevance to guide compression. DAST combines perplexity-based local information with attention-driven global information to dynamically allocate soft tokens to the informative-rich chunks, enabling effective, context-aware compression. Experimental results across multiple benchmarks demonstrate that DAST surpasses state-of-the-art methods.
The proposed Dynamic Allocation of Soft Tokens (DAST) method addresses LLM inefficiencies in long-context processing by dynamically allocating compression tokens to information-rich chunks, outperforming existing techniques through context-aware optimization.
English Summary:

Authors:Granite Vision Team, Leonid Karlinsky, Assaf Arbelle, Abraham Daniels, Ahmed Nassar, Amit Alfassi, Bo Wu, Eli Schwartz, Dhiraj Joshi, Jovana Kondic, Nimrod Shabtay, Pengyuan Li, Roei Herzig, Shafiq Abedin, Shaked Perek, Sivan Harary, Udi Barzelay, Adi Raz Goldfarb, Aude Oliva, Ben Wieles, Bishwaranjan Bhattacharjee, Brandon Huang, Christoph Auer, Dan Gutfreund, David Beymer, David Wood, Hilde Kuehne, Jacob Hansen, Joseph Shtok, Ken Wong, Luis Angel Bathen, Mayank Mishra, Maksym Lysak, Michele Dolfi, Mikhail Yurochkin, Nikolaos Livathinos, Nimrod Harel, Ophir Azulai, Oshri Naparstek, Rafael Teixeira de Lima, Rameswar Panda, Sivan Doveh, Shubham Gupta, Subhro Das, Syed Zawad, Yusik Kim, Zexue He, Alexander Brooks, Gabe Goodhart, Anita Govindjee, Derek Leist, Ibrahim Ibrahim, Aya Soffer, David Cox, Kate Soule, Luis Lastras, Nirmit Desai, Shila Ofek-koifman, Sriram Raghavan, Tanveer Syeda-Mahmood, Peter Staar, Tal Drory, Rogerio Feris
Title: Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence
Abstract:
We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.
中文摘要:Granite Vision是一款轻量级视觉语言模型,专门针对企业文档理解任务设计,具备20亿参数,在各项基准测试中表现优异,采用安全分类机制,并通过Apache-2许可证开源供商业使用。
English Summary: Granite Vision is a lightweight 2-billion-parameter vision-language model specialized for enterprise document understanding tasks, achieving strong benchmark performance while incorporating safety mechanisms and being released under Apache-2 license for commercial use.

Authors:Lirong Wu, Yunfan Liu, Haitao Lin, Yufei Huang, Guojiang Zhao, Zhifeng Gao, Stan Z. Li
Title: A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer
Abstract:
The proteins that exist today have been optimized over billions of years of natural evolution, during which nature creates random mutations and selects them. The discovery of functionally promising mutations is challenged by the limited evolutionary accessible regions, i.e., only a small region on the fitness landscape is beneficial. There have been numerous priors used to constrain protein evolution to regions of landscapes with high-fitness variants, among which the change in binding free energy (DDG) of protein complexes upon mutations is one of the most commonly used priors. However, the huge mutation space poses two challenges: (1) how to improve the efficiency of DDG prediction for fast mutation screening; and (2) how to explain mutation preferences and efficiently explore accessible evolutionary regions. To address these challenges, we propose a lightweight DDG predictor (Light-DDG), which adopts a structure-aware Transformer as the backbone and enhances it by knowledge distilled from existing powerful but computationally heavy DDG predictors. Additionally, we augmented, annotated, and released a large-scale dataset containing millions of mutation data for pre-training Light-DDG. We find that such a simple yet effective Light-DDG can serve as a good unsupervised antibody optimizer and explainer. For the target antibody, we propose a novel Mutation Explainer to learn mutation preferences, which accounts for the marginal benefit of each mutation per residue. To further explore accessible evolutionary regions, we conduct preference-guided antibody optimization and evaluate antibody candidates quickly using Light-DDG to identify desirable mutations.
中文:Light-DDG作为一种轻量级预测器,通过结构感知Transformer和知识蒸馏高效筛选蛋白质突变,而突变解释器则识别有益突变以指导抗体优化。
English: Light-DDG is a lightweight predictor that efficiently screens protein mutations by using a structure-aware Transformer and knowledge distillation, while the Mutation Explainer identifies beneficial mutations to guide antibody optimization.

Authors:Qiang Zhu, Fan Zhang, Feiyu Chen, Shuyuan Zhu, David Bull, Bing Zeng
Title: FCVSR: A Frequency-aware Method for Compressed Video Super-Resolution
Abstract:
Compressed video super-resolution (SR) aims to generate high-resolution (HR) videos from the corresponding low-resolution (LR) compressed videos. Recently, some compressed video SR methods attempt to exploit the spatio-temporal information in the frequency domain, showing great promise in super-resolution performance. However, these methods do not differentiate various frequency subbands spatially or capture the temporal frequency dynamics, potentially leading to suboptimal results. In this paper, we propose a deep frequency-based compressed video SR model (FCVSR) consisting of a motion-guided adaptive alignment (MGAA) network and a multi-frequency feature refinement (MFFR) module. Additionally, a frequency-aware contrastive loss is proposed for training FCVSR, in order to reconstruct finer spatial details. The proposed model has been evaluated on three public compressed video super-resolution datasets, with results demonstrating its effectiveness when compared to existing works in terms of super-resolution performance (up to a 0.14dB gain in PSNR over the second-best model) and complexity.
中文摘要:本文提出FCVSR,一种基于深度频率的压缩视频超分辨率模型,通过运动引导对齐和多频特征优化提升重建效果,在多项测试中比现有最佳方法PSNR指标最高提升0.14分贝。
English Summary: This paper introduces FCVSR, a deep frequency-based model for compressed video super-resolution that enhances performance through motion-guided alignment and multi-frequency refinement, achieving up to 0.14dB PSNR improvement over existing methods.

Authors:Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, Ying Zhang
Title: Common Neighborhood Estimation over Bipartite Graphs under Local Differential Privacy
Abstract:
Bipartite graphs, formed by two vertex layers, arise as a natural fit for modeling the relationships between two groups of entities. In bipartite graphs, common neighborhood computation between two vertices on the same vertex layer is a basic operator, which is easily solvable in general settings. However, it inevitably involves releasing the neighborhood information of vertices, posing a significant privacy risk for users in real-world applications. To protect edge privacy in bipartite graphs, in this paper, we study the problem of estimating the number of common neighbors of two vertices on the same layer under edge local differential privacy (edge LDP). The problem is challenging in the context of edge LDP since each vertex on the opposite layer of the query vertices can potentially be a common neighbor. To obtain efficient and accurate estimates, we propose a multiple-round framework that significantly reduces the candidate pool of common neighbors and enables the query vertices to construct unbiased estimators locally. Furthermore, we improve data utility by incorporating the estimators built from the neighbors of both query vertices and devise privacy budget allocation optimizations. These improve the estimator's robustness and consistency, particularly against query vertices with imbalanced degrees. Extensive experiments on 15 datasets validate the effectiveness and efficiency of our proposed techniques.
中文: 本文针对二分图中边隐私保护问题,提出了一种基于边局部差分隐私的多轮框架,通过显著减少候选共同邻居并构建无偏估计器,有效提升了数据效用和估计的鲁棒性。
English: This paper addresses the privacy risks in bipartite graphs by proposing a multiple-round framework under edge local differential privacy to efficiently and accurately estimate common neighbors while incorporating optimizations for improved data utility and robustness.

Authors:Patrick Iff, Benigna Bruggmann, Maciej Besta, Luca Benini, Torsten Hoefler
Title: PlaceIT: Placement-based Inter-Chiplet Interconnect Topologies
Abstract:
2.5D integration technology is gaining traction as it copes with the exponentially growing design cost of modern integrated circuits. A crucial part of a 2.5D stacked chip is a low-latency and high-throughput inter-chiplet interconnect (ICI). Two major factors affecting the latency and throughput are the topology of links between chiplets and the chiplet placement. In this work, we present PlaceIT, a novel methodology to jointly optimize the ICI topology and the chiplet placement. While state-of-the-art methods optimize the chiplet placement for a predetermined ICI topology, or they select one topology out of a set of candidates, we generate a completely new topology for each placement. Our process of inferring placement-based ICI topologies connects chiplets that are in close proximity to each other, making it particularly attractive for chips with silicon bridges or passive silicon interposers with severely limited link lengths. We provide an open-source implementation of our method that optimizes the placement of homogeneously or heterogeneously shaped chiplets and the ICI topology connecting them for a user-defined mix of four different traffic types. We evaluate our methodology using synthetic traffic and traces, and we compare our results to a 2D mesh baseline. PlaceIT reduces the latency of synthetic L1-to-L2 and L2-to-memory traffic, the two most important types for cache coherency traffic, by up to 28% and 62%, respectively. It also achieve an average packet latency reduction of up to 18% on traffic traces. PlaceIT enables the construction of 2.5D stacked chips with low-latency ICIs.
中文: PlaceIT是一种新颖的方法,它联合优化了芯粒间互连拓扑和芯粒布局,基于邻近性生成新拓扑,显著降低了2.5D堆叠芯片的延迟并提升了性能。
English: PlaceIT is a novel methodology that jointly optimizes inter-chiplet interconnect topology and chiplet placement, generating new topologies based on proximity to significantly reduce latency and improve performance in 2.5D stacked chips.

Authors:Qi Dai, Beixiong Zheng, Qiyao Wang, Xue Xiong, Xiaodan Shao, Lipeng Zhu, Rui Zhang
Title: A Demo of Radar Sensing Aided Rotatable Antenna for Wireless Communication System
Abstract:
Rotatable antenna (RA) represents a novel antenna architecture that enhances wireless communication system performance by independently or collectively adjusting each antenna's boresight/orientation. In this demonstration, we develop a prototype of radar sensing-aided rotatable antenna that integrates radar sensing with dynamic antenna orientation to enhance wireless communication performance while maintaining low hardware costs. The proposed prototype consists of a transmitter (TX) module and a receiver (RX) module, both of which employ universal software radio peripherals (USRPs) for transmitting and receiving signals. Specifically, the TX utilizes a laser radar to detect the RX's location and conveys the angle of arrival (AoA) information to its antenna servo, which enables the RA to align its boresight direction with the identified RX. Experimental results examine the effectiveness of the proposed prototype and indicate that the RA significantly outperforms the traditional fixed-antenna system in terms of increasing received signal-to-noise ratio (SNR).
中文: 该可旋转天线原型通过集成雷达感知动态对准接收器方向,在保持低成本的同时,相比传统固定天线系统显著提升了接收信噪比。
English: The rotatable antenna prototype integrates radar sensing to dynamically align its boresight with the receiver, significantly boosting signal-to-noise ratio compared to fixed antennas while maintaining cost efficiency.

Authors:Zhaoxuan Wang, Yang Li, Jie Zhang, Xingshuo Han, Kangbo Liu, Lyu Yang, yuan Zhou, Tianwei Zhang, Quan Pan
Title: SSD: A State-based Stealthy Backdoor Attack For Navigation System in UAV Route Planning
Abstract:
Unmanned aerial vehicles (UAVs) are increasingly employed to perform high-risk tasks that require minimal human intervention. However, UAVs face escalating cybersecurity threats, particularly from GNSS spoofing attacks. While previous studies have extensively investigated the impacts of GNSS spoofing on UAVs, few have focused on its effects on specific tasks. Moreover, the influence of UAV motion states on the assessment of network security risks is often overlooked. To address these gaps, we first provide a detailed evaluation of how motion states affect the effectiveness of network attacks. We demonstrate that nonlinear motion states not only enhance the effectiveness of position spoofing in GNSS spoofing attacks but also reduce the probability of speed-related attack detection. Building upon this, we propose a state-triggered backdoor attack method (SSD) to deceive GNSS systems and assess its risk to trajectory planning tasks. Extensive validation of SSD's effectiveness and stealthiness is conducted. Experimental results show that, with appropriately tuned hyperparameters, SSD significantly increases positioning errors and the risk of task failure, while maintaining 100% stealth across three state-of-the-art detectors.
中文摘要:本研究评估了无人机运动状态对GNSS欺骗攻击效果的影响,并提出一种状态触发的后门攻击方法(SSD),能在保持完全隐蔽性的同时显著增加定位误差和任务失败风险。
English Summary: This study evaluates how UAV motion states influence GNSS spoofing attack effectiveness and introduces a state-triggered backdoor method (SSD) that significantly increases positioning errors and task failure risks while remaining undetectable by advanced detectors.

Authors:Kaishuai Xu, Tiezheng Yu, Wenjun Hou, Yi Cheng, Liangyou Li, Xin Jiang, Lifeng Shang, Qun Liu, Wenjie Li
Title: Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Abstract:
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.
中文摘要:ARJudge框架通过自适应制定评估标准并整合文本分析与代码驱动分析,解决了当前大语言模型评估器的局限性,在有效性和鲁棒性上超越了现有方法。
English Summary: The ARJudge framework addresses limitations in current LLM evaluators by adaptively formulating criteria and integrating text-based with code-driven analyses, outperforming existing methods in effectiveness and robustness.

Authors:Yuwei Yan, Yu Shang, Qingbin Zeng, Yu Li, Keyu Zhao, Zhiheng Zheng, Xuefei Ning, Tianji Wu, Shengen Yan, Yu Wang, Fengli Xu, Yong Li
Title: AgentSociety Challenge: Designing LLM Agents for User Modeling and Recommendation on Web Platforms
Abstract:
The AgentSociety Challenge is the first competition in the Web Conference that aims to explore the potential of Large Language Model (LLM) agents in modeling user behavior and enhancing recommender systems on web platforms. The Challenge consists of two tracks: the User Modeling Track and the Recommendation Track. Participants are tasked to utilize a combined dataset from Yelp, Amazon, and Goodreads, along with an interactive environment simulator, to develop innovative LLM agents. The Challenge has attracted 295 teams across the globe and received over 1,400 submissions in total over the course of 37 official competition days. The participants have achieved 21.9% and 20.3% performance improvement for Track 1 and Track 2 in the Development Phase, and 9.1% and 15.9% in the Final Phase, representing a significant accomplishment. This paper discusses the detailed designs of the Challenge, analyzes the outcomes, and highlights the most successful LLM agent designs. To support further research and development, we have open-sourced the benchmark environment at https://tsinghua-fib-lab.github.io/AgentSocietyChallenge.
中文: AgentSociety挑战赛是首个专注于利用大语言模型代理模拟用户行为并提升推荐系统的网络会议竞赛,吸引了全球团队参与并在两个赛道上取得了显著性能提升。
English: The AgentSociety Challenge is the inaugural Web Conference competition focused on leveraging LLM agents to model user behavior and improve recommender systems, attracting global participation and achieving significant performance gains across two tracks.

Authors:Minhua Lin, Hui Liu, Xianfeng Tang, Jingying Zeng, Zhenwei Dai, Chen Luo, Zheng Li, Xiang Zhang, Qi He, Suhang Wang
Title: How Far are LLMs from Real Search? A Comprehensive Study on Efficiency, Completeness, and Inherent Capabilities
Abstract:
Search plays a fundamental role in problem-solving across various domains, with most real-world decision-making problems being solvable through systematic search. Drawing inspiration from recent discussions on search and learning, we systematically explore the complementary relationship between search and Large Language Models (LLMs) from three perspectives. First, we analyze how learning can enhance search efficiency and propose Search via Learning (SeaL), a framework that leverages LLMs for effective and efficient search. Second, we further extend SeaL to SeaL-C to ensure rigorous completeness during search. Our evaluation across three real-world planning tasks demonstrates that SeaL achieves near-perfect accuracy while reducing search spaces by up to 99.1% compared to traditional approaches. Finally, we explore how far LLMs are from real search by investigating whether they can develop search capabilities independently. Our analysis reveals that while current LLMs struggle with efficient search in complex problems, incorporating systematic search strategies significantly enhances their problem-solving capabilities. These findings not only validate the effectiveness of our approach but also highlight the need for improving LLMs' search abilities for real-world applications.
中文: 本研究提出SEAL框架,利用大语言模型提升搜索效率,在规划任务中实现接近完美的准确率并将搜索空间减少高达99.1%,同时指出需要增强大语言模型在复杂问题中的独立搜索能力。
English: This study introduces the SeaL framework, which leverages Large Language Models (LLMs) to enhance search efficiency, achieving near-perfect accuracy and reducing search spaces by up to 99.1% in planning tasks, while also highlighting the need to improve LLMs' independent search capabilities for complex problems.

Authors:Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Personalized Federated Learning for Egocentric Video Gaze Estimation with Comprehensive Parameter Frezzing
Abstract:
Egocentric video gaze estimation requires models to capture individual gaze patterns while adapting to diverse user data. Our approach leverages a transformer-based architecture, integrating it into a PFL framework where only the most significant parameters, those exhibiting the highest rate of change during training, are selected and frozen for personalization in client models. Through extensive experimentation on the EGTEA Gaze+ and Ego4D datasets, we demonstrate that FedCPF significantly outperforms previously reported federated learning methods, achieving superior recall, precision, and F1-score. These results confirm the effectiveness of our comprehensive parameters freezing strategy in enhancing model personalization, making FedCPF a promising approach for tasks requiring both adaptability and accuracy in federated learning settings.
中文:我们的FedCPF方法在个性化联邦学习框架中采用基于Transformer的架构,通过选择性冻结关键参数来增强个性化,在EGTEA Gaze+和Ego4D数据集上的召回率、精确率和F1分数均显著优于现有方法。
English: Our FedCPF method uses a transformer-based architecture within a personalized federated learning framework, selectively freezing key parameters to enhance personalization and significantly outperforms existing approaches on EGTEA Gaze+ and Ego4D datasets in recall, precision, and F1-score.

Authors:Lipeng Zhu, Wenyan Ma, Weidong Mei, Yong Zeng, Qingqing Wu, Boyu Ning, Zhenyu Xiao, Xiaodan Shao, Jun Zhang, Rui Zhang
Title: A Tutorial on Movable Antennas for Wireless Networks
Abstract:
Movable antenna (MA) has been recognized as a promising technology to enhance the performance of wireless communication and sensing by enabling antenna movement. Such a significant paradigm shift from conventional fixed antennas (FAs) to MAs offers tremendous new opportunities towards realizing more versatile, adaptive and efficient next-generation wireless networks such as 6G. In this paper, we provide a comprehensive tutorial on the fundamentals and advancements in the area of MA-empowered wireless networks. First, we overview the historical development and contemporary applications of MA technologies. Next, to characterize the continuous variation in wireless channels with respect to antenna position and/or orientation, we present new field-response channel models tailored for MAs, which are applicable to narrowband and wideband systems as well as far-field and near-field propagation conditions. Subsequently, we review the state-of-the-art architectures for implementing MAs and discuss their practical constraints. A general optimization framework is then formulated to fully exploit the spatial degrees of freedom (DoFs) in antenna movement for performance enhancement in wireless systems. In particular, we delve into two major design issues for MA systems. First, we address the intricate antenna movement optimization problem for various communication and/or sensing systems to maximize the performance gains achievable by MAs. Second, we deal with the challenging channel acquisition issue in MA systems for reconstructing the channel mapping between arbitrary antenna positions inside the transmitter and receiver regions. Moreover, we show existing prototypes developed for MA-aided communication/sensing and the experimental results based on them. Finally, the extension of MA design to other wireless systems and its synergy with other emerging wireless technologies are discussed.
中文: 本文全面综述了可移动天线技术,涵盖其基本原理、信道建模、优化框架及实际应用,旨在提升无线通信和感知性能,助力实现如6G等下一代网络的更高适应性与效率。
English: This paper presents a comprehensive tutorial on movable antenna (MA) technology, detailing its fundamentals, channel modeling, optimization frameworks, and practical implementations to enhance wireless communication and sensing performance for next-generation networks like 6G.

Authors:Jie Ren, Zhenwei Dai, Xianfeng Tang, Hui Liu, Jingying Zeng, Zhen Li, Rahul Goutam, Suhang Wang, Yue Xing, Qi He, Hui Liu
Title: A General Framework to Enhance Fine-tuning-based LLM Unlearning
Abstract:
Unlearning has been proposed to remove copyrighted and privacy-sensitive data from Large Language Models (LLMs). Existing approaches primarily rely on fine-tuning-based methods, which can be categorized into gradient ascent-based (GA-based) and suppression-based methods. However, they often degrade model utility (the ability to respond to normal prompts). In this work, we aim to develop a general framework that enhances the utility of fine-tuning-based unlearning methods. To achieve this goal, we first investigate the common property between GA-based and suppression-based methods. We unveil that GA-based methods unlearn by distinguishing the target data (i.e., the data to be removed) and suppressing related generations, which is essentially the same strategy employed by suppression-based methods. Inspired by this finding, we introduce Gated Representation UNlearning (GRUN) which has two components: a soft gate function for distinguishing target data and a suppression module using Representation Fine-tuning (ReFT) to adjust representations rather than model parameters. Experiments show that GRUN significantly improves the unlearning and utility. Meanwhile, it is general for fine-tuning-based methods, efficient and promising for sequential unlearning.
中文: 本文提出GRUN框架,通过区分目标数据和抑制相关生成来增强大语言模型的遗忘能力,在显著提升遗忘效果的同时保持模型正常提示的响应能力。
English: This paper introduces GRUN, a general framework that enhances unlearning in LLMs by distinguishing target data and suppressing related generations, significantly improving both unlearning effectiveness and model utility without degrading performance on normal prompts.

Authors:Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang
Title: VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
Abstract:
Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/
中文摘要:VideoGrain提出了一种零样本方法,通过优化时空注意力机制来提升多粒度视频编辑能力,在真实场景中实现了卓越的控制效果和顶尖性能。
English Summary: VideoGrain introduces a zero-shot method that enhances multi-grained video editing by optimizing space-time attention mechanisms, achieving superior control and state-of-the-art results in real-world applications.

Authors:Liang Dai, Beixiong Zheng, Yanhua Tan, Lipeng Zhu, Fangjiong Chen, Rui Zhang
Title: Rotatable Antenna Enabled Wireless Communication System with Visual Recognition: A Prototype Implementation
Abstract:
Rotatable antenna (RA) is an emerging technology that has great potential to exploit additional spatial degrees of freedom (DoFs) by flexibly altering the three-dimensional (3D) orientation/boresight of each antenna. In this demonstration, we present a prototype of the RA-enabled wireless communication system with a visual recognition module to evaluate the performance gains provided by the RA in practical environments. In particular, a mechanically-driven RA is developed by integrating a digital servo motor, a directional antenna, and a microcontroller, which enables the dynamic adjustment of the RA orientation. Moreover, the orientation adjustment of the RA is guided by the user's direction information provided by the visual recognition module, thereby significantly enhancing system response speed and self-orientation accuracy. The experimental results demonstrate that the RA-enabled communication system achieves significant improvement in communication coverage performance compared to the conventional fixed antenna system.
中文: 可旋转天线技术通过视觉识别模块引导动态调整天线方向,相比传统固定天线系统显著提升了通信覆盖性能。
English: Rotatable antenna technology enhances wireless communication by dynamically adjusting antenna orientation using visual recognition guidance, significantly improving coverage and performance over fixed systems.

Authors:Ruikun Li, Huandong Wang, Qingmin Liao, Yong Li
Title: Predicting the Energy Landscape of Stochastic Dynamical System via Physics-informed Self-supervised Learning
Abstract:
Energy landscapes play a crucial role in shaping dynamics of many real-world complex systems. System evolution is often modeled as particles moving on a landscape under the combined effect of energy-driven drift and noise-induced diffusion, where the energy governs the long-term motion of the particles. Estimating the energy landscape of a system has been a longstanding interdisciplinary challenge, hindered by the high operational costs or the difficulty of obtaining supervisory signals. Therefore, the question of how to infer the energy landscape in the absence of true energy values is critical. In this paper, we propose a physics-informed self-supervised learning method to learn the energy landscape from the evolution trajectories of the system. It first maps the system state from the observation space to a discrete landscape space by an adaptive codebook, and then explicitly integrates energy into the graph neural Fokker-Planck equation, enabling the joint learning of energy estimation and evolution prediction. Experimental results across interdisciplinary systems demonstrate that our estimated energy has a correlation coefficient above 0.9 with the ground truth, and evolution prediction accuracy exceeds the baseline by an average of 17.65\%. The code is available at github.com/tsinghua-fib-lab/PESLA.
中文: 本文提出了一种物理信息驱动的自监督学习方法,通过自适应码本和图神经Fokker-Planck方程从系统演化轨迹中精确估计能量景观,与真实值的相关系数超过0.9,演化预测准确率比基线平均提高17.65%。
English: This paper introduces a physics-informed self-supervised learning method that accurately estimates energy landscapes from system evolution trajectories by integrating an adaptive codebook and graph neural Fokker-Planck equation, achieving over 0.9 correlation with ground truth and 17.65% higher prediction accuracy than baselines.

Authors:Won Seok Jang, Sharmin Sultana, Zonghai Yao, Hieu Tran, Zhichao Yang, Sunjae Kwon, Hong Yu
Title: Enhancing LLMs for Identifying and Prioritizing Important Medical Jargons from Electronic Health Record Notes Utilizing Data Augmentation
Abstract:
OpenNotes enables patients to access EHR notes, but medical jargon can hinder comprehension. To improve understanding, we evaluated closed- and open-source LLMs for extracting and prioritizing key medical terms using prompting, fine-tuning, and data augmentation. We assessed LLMs on 106 expert-annotated EHR notes, experimenting with (i) general vs. structured prompts, (ii) zero-shot vs. few-shot prompting, (iii) fine-tuning, and (iv) data augmentation. To enhance open-source models in low-resource settings, we used ChatGPT for data augmentation and applied ranking techniques. We incrementally increased the augmented dataset size (10 to 10,000) and conducted 5-fold cross-validation, reporting F1 score and Mean Reciprocal Rank (MRR). Our result show that fine-tuning and data augmentation improved performance over other strategies. GPT-4 Turbo achieved the highest F1 (0.433), while Mistral7B with data augmentation had the highest MRR (0.746). Open-source models, when fine-tuned or augmented, outperformed closed-source models. Notably, the best F1 and MRR scores did not always align. Few-shot prompting outperformed zero-shot in vanilla models, and structured prompts yielded different preferences across models. Fine-tuning improved zero-shot performance but sometimes degraded few-shot performance. Data augmentation performed comparably or better than other methods. Our evaluation highlights the effectiveness of prompting, fine-tuning, and data augmentation in improving model performance for medical jargon extraction in low-resource scenarios.
中文摘要:本研究证明,通过微调和数据增强能显著提升大语言模型从电子病历中提取和排序医学术语的能力,其中GPT-4 Turbo获得最高F1分数,而增强后的Mistral7B实现了最佳排序效果。
English Summary: This study demonstrates that fine-tuning and data augmentation significantly enhance large language models' ability to extract and prioritize medical terms from electronic health records, with GPT-4 Turbo achieving the highest F1 score and augmented Mistral7B attaining the best ranking performance.

Authors:Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang
Title: PPC-GPT: Federated Task-Specific Compression of Large Language Models via Pruning and Chain-of-Thought Distillation
Abstract:
Compressing Large Language Models (LLMs) into task-specific Small Language Models (SLMs) encounters two significant challenges: safeguarding domain-specific knowledge privacy and managing limited resources. To tackle these challenges, we propose PPC-GPT, a innovative privacy-preserving federated framework specifically designed for compressing LLMs into task-specific SLMs via pruning and Chain-of-Thought (COT) distillation. PPC-GPT works on a server-client federated architecture, where the client sends differentially private (DP) perturbed task-specific data to the server's LLM. The LLM then generates synthetic data along with their corresponding rationales. This synthetic data is subsequently used for both LLM pruning and retraining processes. Additionally, we harness COT knowledge distillation, leveraging the synthetic data to further improve the retraining of structurally-pruned SLMs. Our experimental results demonstrate the effectiveness of PPC-GPT across various text generation tasks. By compressing LLMs into task-specific SLMs, PPC-GPT not only achieves competitive performance but also prioritizes data privacy protection.
中文摘要:PPC-GPT是一种隐私保护的联邦框架,通过差分隐私、剪枝和思维链蒸馏技术将大语言模型压缩为任务专用的小模型,在保持竞争力的同时有效保护数据隐私。
English Summary: PPC-GPT is a privacy-preserving federated framework that compresses large language models into task-specific small models using differential privacy, pruning, and chain-of-thought distillation while maintaining competitive performance.

Authors:Zhuohang Long, Siyuan Wang, Shujun Liu, Yuhang Lai, Xuanjing Huang, Zhongyu Wei
Title: How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation
Abstract:
Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.
中文: 本研究将越狱防御重构为二元分类任务,识别出安全偏移和危害区分两种防御机制,并开发了集成策略,有效提升大型视觉语言模型的安全性或优化安全与实用性的平衡。
English: This study reframes jailbreak defense as a binary classification task to identify safety shift and harmfulness discrimination mechanisms, developing ensemble strategies that effectively enhance model safety or optimize the safety-helpfulness balance in Large Vision-Language Models.

Authors:Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong
Title: Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts
Abstract:
Large language models have demonstrated exceptional performance across a wide range of tasks. However, dense models usually suffer from sparse activation, where many activation values tend towards zero (i.e., being inactivated). We argue that this could restrict the efficient exploration of model representation space. To mitigate this issue, we propose Finedeep, a deep-layered fine-grained expert architecture for dense models. Our framework partitions the feed-forward neural network layers of traditional dense models into small experts, arranges them across multiple sub-layers. A novel routing mechanism is proposed to determine each expert's contribution. We conduct extensive experiments across various model sizes, demonstrating that our approach significantly outperforms traditional dense architectures in terms of perplexity and benchmark performance while maintaining a comparable number of parameters and floating-point operations. Moreover, we find that Finedeep achieves optimal results when balancing depth and width, specifically by adjusting the number of expert sub-layers and the number of experts per sub-layer. Empirical results confirm that Finedeep effectively alleviates sparse activation and efficiently utilizes representation capacity in dense models.
中文: 提出的Finedeep架构通过将前馈神经网络分割为细粒度专家并采用新型路由机制,有效缓解了稠密模型中的稀疏激活问题,在保持计算效率的同时显著提升了模型性能。
English: The proposed Finedeep architecture addresses sparse activation in dense language models by partitioning feed-forward layers into fine-grained experts with a novel routing mechanism, significantly improving performance while maintaining computational efficiency.

Authors:Junkai Chen, Zhijie Deng, Kening Zheng, Yibo Yan, Shuliang Liu, PeiJun Wu, Peijie Jiang, Jia Liu, Xuming Hu
Title: SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Abstract:
As Multimodal Large Language Models (MLLMs) develop, their potential security issues have become increasingly prominent. Machine Unlearning (MU), as an effective strategy for forgetting specific knowledge in training data, has been widely used in privacy protection. However, MU for safety in MLLM has yet to be fully explored. To address this issue, we propose SAFEERASER, a safety unlearning benchmark for MLLMs, consisting of 3,000 images and 28.8K VQA pairs. We comprehensively evaluate unlearning methods from two perspectives: forget quality and model utility. Our findings show that existing MU methods struggle to maintain model performance while implementing the forget operation and often suffer from over-forgetting. Hence, we introduce Prompt Decouple (PD) Loss to alleviate over-forgetting through decouple prompt during unlearning process. To quantitatively measure over-forgetting mitigated by PD Loss, we propose a new metric called Safe Answer Refusal Rate (SARR). Experimental results demonstrate that combining PD Loss with existing unlearning methods can effectively prevent over-forgetting and achieve a decrease of 79.5% in the SARR metric of LLaVA-7B and LLaVA-13B, while maintaining forget quality and model utility. Our code and dataset will be released upon acceptance. Warning: This paper contains examples of harmful language and images, and reader discretion is recommended.
中文摘要:本文提出SAFEERASER这一多模态大语言模型安全遗忘基准,通过提示解耦损失有效缓解过度遗忘问题,在保持模型性能的同时将不安全回答率降低79.5%。
English Summary: This paper introduces SAFEERASER, a safety unlearning benchmark for Multimodal Large Language Models that addresses over-forgetting through Prompt Decouple Loss, achieving a 79.5% reduction in unsafe responses while preserving model performance.

Authors:Junkai Chen, Zhijie Deng, Kening Zheng, Yibo Yan, Shuliang Liu, PeiJun Wu, Peijie Jiang, Jia Liu, Xuming Hu
Title: SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Abstract:
As Multimodal Large Language Models (MLLMs) develop, their potential security issues have become increasingly prominent. Machine Unlearning (MU), as an effective strategy for forgetting specific knowledge in training data, has been widely used in privacy protection. However, MU for safety in MLLM has yet to be fully explored. To address this issue, we propose SAFEERASER, a safety unlearning benchmark for MLLMs, consisting of 3,000 images and 28.8K VQA pairs. We comprehensively evaluate unlearning methods from two perspectives: forget quality and model utility. Our findings show that existing MU methods struggle to maintain model performance while implementing the forget operation and often suffer from over-forgetting. Hence, we introduce Prompt Decouple (PD) Loss to alleviate over-forgetting through decouple prompt during unlearning process. To quantitatively measure over-forgetting mitigated by PD Loss, we propose a new metric called Safe Answer Refusal Rate (SARR). Experimental results demonstrate that combining PD Loss with existing unlearning methods can effectively prevent over-forgetting and achieve a decrease of 79.5% in the SARR metric of LLaVA-7B and LLaVA-13B, while maintaining forget quality and model utility. Our code and dataset will be released upon acceptance. Warning: This paper contains examples of harmful language and images, and reader discretion is recommended.
中文摘要:本文提出SAFEERASER这一多模态大语言模型安全遗忘基准,通过提示解耦损失有效缓解过度遗忘问题,在保持模型性能的同时将不安全回答率降低79.5%。
English Summary: This paper introduces SAFEERASER, a safety unlearning benchmark for Multimodal Large Language Models that addresses over-forgetting through Prompt Decouple Loss, achieving a 79.5% reduction in unsafe responses while preserving model performance.

Authors:Minxuan Lv, Zhenpeng Su, Leiyu Pan, Yizhe Xiong, Zijia Lin, Hui Chen, Wei Zhou, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Songlin Hu
Title: DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs
Abstract:
As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.
中文: 本文提出DSMoE方法,通过将预训练FFN层划分为计算块并采用自适应路由与稀疏性损失,在同等计算约束下实现了优于现有方法的性能,同时揭示了独特的层级激活模式。
English: This paper introduces DSMoE, a dynamic sparse mixture-of-experts method that partitions pre-trained FFN layers into computational blocks with adaptive routing and sparsity loss, achieving superior performance over existing approaches under equivalent computational constraints while revealing distinctive layerwise activation patterns.

Authors:Tao Fan, Hanlin Gu, Xuemei Cao, Chee Seng Chan, Qian Chen, Yiqiang Chen, Yihui Feng, Yang Gu, Jiaxiang Geng, Bing Luo, Shuoling Liu, Win Kent Ong, Chao Ren, Jiaqi Shao, Chuan Sun, Xiaoli Tang, Hong Xi Tae, Yongxin Tong, Shuyue Wei, Fan Wu, Wei Xi, Mingcong Xu, He Yang, Xin Yang, Jiangpeng Yan, Hao Yu, Han Yu, Teng Zhang, Yifei Zhang, Xiaojin Zhang, Zhenzhe Zheng, Lixin Fan, Qiang Yang
Title: Ten Challenging Problems in Federated Foundation Models
Abstract:
Federated Foundation Models (FedFMs) represent a distributed learning paradigm that fuses general competences of foundation models as well as privacy-preserving capabilities of federated learning. This combination allows the large foundation models and the small local domain models at the remote clients to learn from each other in a teacher-student learning setting. This paper provides a comprehensive summary of the ten challenging problems inherent in FedFMs, encompassing foundational theory, utilization of private data, continual learning, unlearning, Non-IID and graph data, bidirectional knowledge transfer, incentive mechanism design, game mechanism design, model watermarking, and efficiency. The ten challenging problems manifest in five pivotal aspects: ``Foundational Theory," which aims to establish a coherent and unifying theoretical framework for FedFMs. ``Data," addressing the difficulties in leveraging domain-specific knowledge from private data while maintaining privacy; ``Heterogeneity," examining variations in data, model, and computational resources across clients; ``Security and Privacy," focusing on defenses against malicious attacks and model theft; and ``Efficiency," highlighting the need for improvements in training, communication, and parameter efficiency. For each problem, we offer a clear mathematical definition on the objective function, analyze existing methods, and discuss the key challenges and potential solutions. This in-depth exploration aims to advance the theoretical foundations of FedFMs, guide practical implementations, and inspire future research to overcome these obstacles, thereby enabling the robust, efficient, and privacy-preserving FedFMs in various real-world applications.
中文摘要:联邦基础模型(FedFMs)融合了基础模型的通用能力与联邦学习的隐私保护优势,针对基础理论、数据利用、异构性、安全隐私及效率五大核心领域中的十大挑战展开系统研究,旨在推动分布式人工智能系统的理论发展与实际应用。
English Summary: Federated Foundation Models (FedFMs) integrate foundation models' capabilities with federated learning's privacy protection, addressing ten key challenges across foundational theory, data utilization, heterogeneity, security, and efficiency to advance robust and practical distributed AI systems.

Authors:Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
Title: Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving
Abstract:
Existing approaches to mathematical reasoning with large language models (LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated Reasoning (TIR) for precise computation. While efforts have been made to combine these methods, they primarily rely on post-selection or predefined strategies, leaving an open question: whether LLMs can autonomously adapt their reasoning strategy based on their inherent capabilities. In this work, we propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework that enables LLMs to personalize their reasoning strategy spontaneously, aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware data selection during supervised fine-tuning (SFT) to tailor training data to the model's unique abilities. This approach equips LLMs to autonomously determine and apply the appropriate reasoning strategy at test time. We evaluate TATA through extensive experiments on six mathematical reasoning benchmarks, using both general-purpose and math-specialized LLMs. Empirical results demonstrate that TATA effectively combines the complementary strengths of CoT and TIR, achieving superior or comparable performance with improved inference efficiency compared to TIR alone. Further analysis underscores the critical role of aptitude-aware data selection in enabling LLMs to make effective and adaptive reasoning decisions and align reasoning strategies with model capabilities.
中文: TATA是一种自适应框架,使大语言模型能够根据自身内在能力自主选择和应用合适的推理策略,有效结合思维链与工具集成推理,从而提升数学推理的性能与效率。
English: TATA is an adaptive framework that enables large language models to autonomously select and apply appropriate reasoning strategies based on their intrinsic capabilities, effectively combining Chain-of-Thought and Tool-Integrated Reasoning for improved mathematical reasoning performance and efficiency.

Authors:Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu
Title: EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models
Abstract:
Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research.
中文: 自动作文评分系统在泛化性、细粒度特征评估和多模态语境处理方面存在挑战,EssayJudge基准通过利用多模态大语言模型实现精确、情境丰富的评估,但实验显示在篇章层面特征上与人工评估仍存差距。
English: Automated Essay Scoring (AES) systems face challenges in generalizability, fine-grained trait evaluation, and multimodal context handling, which the proposed EssayJudge benchmark addresses by leveraging Multimodal Large Language Models (MLLMs) for precise, context-rich assessments, though experiments reveal performance gaps in discourse-level traits compared to human evaluation.

Authors:He Sun, Lipeng Zhu, Weidong Mei, Rui Zhang
Title: Power-Measurement-Based Channel Autocorrelation Estimation for IRS-Assisted Wideband Communications
Abstract:
Channel state information (CSI) is essential to the performance optimization of intelligent reflecting surface (IRS)-aided wireless communication systems. However, the passive and frequency-flat reflection of IRS, as well as the high-dimensional IRS-reflected channels, have posed practical challenges for efficient IRS channel estimation, especially in wideband communication systems with significant multi-path channel delay spread. To tackle the above challenge, we propose a novel neural network (NN)-empowered IRS channel estimation and passive reflection design framework for the wideband orthogonal frequency division multiplexing (OFDM) communication system based only on the user's reference signal received power (RSRP) measurements with time-varying random IRS training reflections. In particular, we show that the average received signal power over all OFDM subcarriers at the user terminal can be represented as the prediction of a single-layer NN composed of multiple subnetworks with the same structure, such that the autocorrelation matrix of the wideband IRS channel can be recovered as their weights via supervised learning. To exploit the potential sparsity of the channel autocorrelation matrix, a progressive training method is proposed by gradually increasing the number of subnetworks until a desired accuracy is achieved, thus reducing the training complexity. Based on the estimates of IRS channel autocorrelation matrix, the IRS passive reflection is then optimized to maximize the average channel power gain over all subcarriers. Numerical results indicate the effectiveness of the proposed IRS channel autocorrelation matrix estimation and passive reflection design under wideband channels, which can achieve significant performance improvement compared to the existing IRS reflection designs based on user power measurements.
中文: 本文提出了一种基于神经网络的框架,利用用户参考信号接收功率测量,在宽带智能反射表面辅助OFDM系统中估计信道自相关矩阵并优化被动反射,有效克服信道估计难题以提升系统性能。
English: The paper introduces a neural network-based framework for estimating the channel autocorrelation matrix and optimizing passive reflection in wideband IRS-aided OFDM systems, using only RSRP measurements to enhance performance despite channel estimation challenges.

Authors:Wenxuan Wang, Xiaoyuan Liu, Kuiyi Gao, Jen-tse Huang, Youliang Yuan, Pinjia He, Shuai Wang, Zhaopeng Tu
Title: Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images. However, ensuring the safety of these models remains a significant challenge, particularly in accurately identifying whether multimodal content is safe or unsafe-a capability we term safety awareness. In this paper, we introduce MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios with 1500 carefully curated image-prompt pairs. MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness. Evaluating nine widely used MLLMs using MMSafeAware reveals that current models are not sufficiently safe and often overly sensitive; for example, GPT-4V misclassifies 36.1% of unsafe inputs as safe and 59.9% of benign inputs as unsafe. We further explore three methods to improve safety awareness-prompting-based approaches, visual contrastive decoding, and vision-centric reasoning fine-tuning-but find that none achieve satisfactory performance. Our findings highlight the profound challenges in developing MLLMs with robust safety awareness, underscoring the need for further research in this area. All the code and data will be publicly available to facilitate future research.
中文: 本文提出了首个综合性多模态安全感知基准MMSafeAware,用于评估多模态大模型在29种安全场景中的表现,发现现有模型在区分安全与不安全内容方面存在显著不足,同时揭示了开发有效安全改进方法面临的深刻挑战。
English: This paper introduces MMSafeAware, the first comprehensive benchmark for evaluating multimodal large language models' safety awareness across 29 scenarios, revealing significant shortcomings in current models' ability to distinguish safe from unsafe content while highlighting the challenges in developing effective safety improvements.

Authors:Shafique Ahmed, Ryandhimas E. Zezario, Hui-Guan Yuan, Amir Hussain, Hsin-Min Wang, Wei-Ho Chung, Yu Tsao
Title: NeuroAMP: A Novel End-to-end General Purpose Deep Neural Amplifier for Personalized Hearing Aids
Abstract:
The prevalence of hearing aids is increasing. However, optimizing the amplification processes of hearing aids remains challenging due to the complexity of integrating multiple modular components in traditional methods. To address this challenge, we present NeuroAMP, a novel deep neural network designed for end-to-end, personalized amplification in hearing aids. NeuroAMP leverages both spectral features and the listener's audiogram as inputs, and we investigate four architectures: Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Convolutional Recurrent Neural Network (CRNN), and Transformer. We also introduce Denoising NeuroAMP, an extension that integrates noise reduction along with amplification capabilities for improved performance in real-world scenarios. To enhance generalization, a comprehensive data augmentation strategy was employed during training on diverse speech (TIMIT and TMHINT) and music (Cadenza Challenge MUSIC) datasets. Evaluation using the Hearing Aid Speech Perception Index (HASPI), Hearing Aid Speech Quality Index (HASQI), and Hearing Aid Audio Quality Index (HAAQI) demonstrates that the Transformer architecture within NeuroAMP achieves the best performance, with SRCC scores of 0.9927 (HASQI) and 0.9905 (HASPI) on TIMIT, and 0.9738 (HAAQI) on the Cadenza Challenge MUSIC dataset. Notably, our data augmentation strategy maintains high performance on unseen datasets (e.g., VCTK, MUSDB18-HQ). Furthermore, Denoising NeuroAMP outperforms both the conventional NAL-R+WDRC approach and a two-stage baseline on the VoiceBank+DEMAND dataset, achieving a 10% improvement in both HASPI (0.90) and HASQI (0.59) scores. These results highlight the potential of NeuroAMP and Denoising NeuroAMP to deliver notable improvements in personalized hearing aid amplification.
中文摘要:NeuroAMP是一种用于助听器个性化放大的深度神经网络,其Transformer架构在主要听觉指标上表现最佳,而具备降噪功能的变体相比传统方法实现了显著性能提升。
English Summary: NeuroAMP is a deep neural network that provides personalized hearing aid amplification, with its Transformer architecture achieving top performance on key auditory indices and its Denoising variant showing significant improvements over conventional methods.

Authors:Kun Wang, Zhiqiang Yan, Junkai Fan, Jun Li, Jian Yang
Title: Learning Inverse Laplacian Pyramid for Progressive Depth Completion
Abstract:
Depth completion endeavors to reconstruct a dense depth map from sparse depth measurements, leveraging the information provided by a corresponding color image. Existing approaches mostly hinge on single-scale propagation strategies that iteratively ameliorate initial coarse depth estimates through pixel-level message passing. Despite their commendable outcomes, these techniques are frequently hampered by computational inefficiencies and a limited grasp of scene context. To circumvent these challenges, we introduce LP-Net, an innovative framework that implements a multi-scale, progressive prediction paradigm based on Laplacian Pyramid decomposition. Diverging from propagation-based approaches, LP-Net initiates with a rudimentary, low-resolution depth prediction to encapsulate the global scene context, subsequently refining this through successive upsampling and the reinstatement of high-frequency details at incremental scales. We have developed two novel modules to bolster this strategy: 1) the Multi-path Feature Pyramid module, which segregates feature maps into discrete pathways, employing multi-scale transformations to amalgamate comprehensive spatial information, and 2) the Selective Depth Filtering module, which dynamically learns to apply both smoothness and sharpness filters to judiciously mitigate noise while accentuating intricate details. By integrating these advancements, LP-Net not only secures state-of-the-art (SOTA) performance across both outdoor and indoor benchmarks such as KITTI, NYUv2, and TOFDC, but also demonstrates superior computational efficiency. At the time of submission, LP-Net ranks 1st among all peer-reviewed methods on the official KITTI leaderboard.
LP-Net introduces a multi-scale Laplacian Pyramid framework that first predicts low-resolution depth for global context, then progressively refines details through novel feature and filtering modules, achieving state-of-the-art accuracy and efficiency on major benchmarks.
English Summary:

Authors:Shengbin Yue, Ting Huang, Zheng Jia, Siyuan Wang, Shujun Liu, Yun Song, Xuanjing Huang, Zhongyu Wei
Title: Multi-Agent Simulator Drives Language Models for Legal Intensive Interaction
Abstract:
Large Language Models (LLMs) have significantly advanced legal intelligence, but the scarcity of scenario data impedes the progress toward interactive legal scenarios. This paper introduces a Multi-agent Legal Simulation Driver (MASER) to scalably generate synthetic data by simulating interactive legal scenarios. Leveraging real-legal case sources, MASER ensures the consistency of legal attributes between participants and introduces a supervisory mechanism to align participants' characters and behaviors as well as addressing distractions. A Multi-stage Interactive Legal Evaluation (MILE) benchmark is further constructed to evaluate LLMs' performance in dynamic legal scenarios. Extensive experiments confirm the effectiveness of our framework.
中文摘要:本文提出MASER多智能体系统,通过模拟交互式法律场景生成合成数据以解决数据稀缺问题,并建立MILE评估基准测试大语言模型在动态法律互动中的表现,实验证实了该框架的有效性。
English Summary: This paper presents MASER, a multi-agent system that generates synthetic legal scenario data to overcome data scarcity, and introduces the MILE benchmark for evaluating LLMs in dynamic legal interactions, with experiments validating its effectiveness.

Authors:Zhijian Duan, Yusen Huo, Tianyu Wang, Zhilin Zhang, Yeshu Li, Chuan Yu, Jian Xu, Bo Zheng, Xiaotie Deng
Title: An Adaptable Budget Planner for Enhancing Budget-Constrained Auto-Bidding in Online Advertising
Abstract:
In online advertising, advertisers commonly utilize auto-bidding services to bid for impression opportunities. A typical objective of the auto-bidder is to optimize the advertiser's cumulative value of winning impressions within specified budget constraints. However, such a problem is challenging due to the complex bidding environment faced by diverse advertisers. To address this challenge, we introduce ABPlanner, a few-shot adaptable budget planner designed to improve budget-constrained auto-bidding. ABPlanner is based on a hierarchical bidding framework that decomposes the bidding process into shorter, manageable stages. Within this framework, ABPlanner allocates the budget across all stages, allowing a low-level auto-bidder to bids based on the budget allocation plan. The adaptability of ABPlanner is achieved through a sequential decision-making approach, inspired by in-context reinforcement learning. For each advertiser, ABPlanner adjusts the budget allocation plan episode by episode, using data from previous episodes as prompt for current decisions. This enables ABPlanner to quickly adapt to different advertisers with few-shot data, providing a sample-efficient solution. Extensive simulation experiments and real-world A/B testing validate the effectiveness of ABPlanner, demonstrating its capability to enhance the cumulative value achieved by auto-bidders.
Chinese: ABPlanner是一种少样本自适应预算规划器,通过分层分配预算并利用历史数据调整方案来优化自动竞价,经模拟和实际测试验证能有效提升累积价值。
English: ABPlanner is a few-shot adaptable budget planner that enhances auto-bidding by hierarchically allocating budgets across stages and adjusting plans using past data, validated through simulations and real-world tests to boost cumulative value.

Authors:Koshi Watanabe, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: StarMAP: Global Neighbor Embedding for Faithful Data Visualization
Abstract:
Neighbor embedding is widely employed to visualize high-dimensional data; however, it frequently overlooks the global structure, e.g., intercluster similarities, thereby impeding accurate visualization. To address this problem, this paper presents Star-attracted Manifold Approximation and Projection (StarMAP), which incorporates the advantage of principal component analysis (PCA) in neighbor embedding. Inspired by the property of PCA embedding, which can be viewed as the largest shadow of the data, StarMAP introduces the concept of \textit{star attraction} by leveraging the PCA embedding. This approach yields faithful global structure preservation while maintaining the interpretability and computational efficiency of neighbor embedding. StarMAP was compared with existing methods in the visualization tasks of toy datasets, single-cell RNA sequencing data, and deep representation. The experimental results show that StarMAP is simple but effective in realizing faithful visualizations.
中文摘要:StarMAP是一种新颖的邻居嵌入方法,通过引入主成分分析的全局结构保持优势,在保证可解释性和计算效率的同时,显著提升了高维数据可视化的准确性。
English Summary: StarMAP is a novel neighbor embedding method that integrates PCA's global structure preservation to enhance visualization accuracy while maintaining interpretability and efficiency.

Authors:Zeyu Tang, Zhenhao Chen, Xiangchen Song, Loka Li, Yunlong Deng, Yifan Shen, Guangyi Chen, Peter Spirtes, Kun Zhang
Title: Reflection-Window Decoding: Text Generation with Selective Refinement
Abstract:
The autoregressive decoding for text generation in large language models (LLMs), while widely used, is inherently suboptimal due to the lack of a built-in mechanism to perform refinement and/or correction of the generated content. In this paper, we consider optimality in terms of the joint probability over the generated response, when jointly considering all tokens at the same time. We theoretically characterize the potential deviation of the autoregressively generated response from its globally optimal counterpart that is of the same length. Our analysis suggests that we need to be cautious when noticeable uncertainty arises during text generation, which may signal the sub-optimality of the generation history. To address the pitfall of autoregressive decoding for text generation, we propose an approach that incorporates a sliding reflection window and a pausing criterion, such that refinement and generation can be carried out interchangeably as the decoding proceeds. Our selective refinement framework strikes a balance between efficiency and optimality, and our extensive experimental results demonstrate the effectiveness of our approach.
中文: 大型语言模型中的自回归解码常因缺乏内容修正机制而产生次优结果,为此我们提出选择性优化框架,通过滑动反思窗口和暂停标准在生成过程中交替进行优化,有效平衡了效率与最优性。
English: The autoregressive decoding in large language models often produces suboptimal results due to its inability to refine generated content, prompting the proposal of a selective refinement framework that uses a sliding reflection window and pausing criterion to enhance text generation by balancing efficiency and optimality.

Authors:Junguang Jiang, Yanwen Huang, Bin Liu, Xiaoyu Kong, Xinhang Li, Ziru Xu, Han Zhu, Jian Xu, Bo Zheng
Title: Large Language Model as Universal Retriever in Industrial-Scale Recommender System
Abstract:
In real-world recommender systems, different retrieval objectives are typically addressed using task-specific datasets with carefully designed model architectures. We demonstrate that Large Language Models (LLMs) can function as universal retrievers, capable of handling multiple objectives within a generative retrieval framework. To model complex user-item relationships within generative retrieval, we propose multi-query representation. To address the challenge of extremely large candidate sets in industrial recommender systems, we introduce matrix decomposition to boost model learnability, discriminability, and transferability, and we incorporate probabilistic sampling to reduce computation costs. Finally, our Universal Retrieval Model (URM) can adaptively generate a set from tens of millions of candidates based on arbitrary given objective while keeping the latency within tens of milliseconds. Applied to industrial-scale data, URM outperforms expert models elaborately designed for different retrieval objectives on offline experiments and significantly improves the core metric of online advertising platform by $3\%$.
中文: 大型语言模型可作为推荐系统中的通用检索器,通过多查询表示和矩阵分解有效处理多目标和海量候选集,其通用检索模型在工业应用中超越专业模型,并将在线广告核心指标提升3%。
English: Large Language Models can serve as universal retrievers in recommender systems, employing multi-query representation and matrix decomposition to efficiently handle multiple objectives and large candidate sets, with the Universal Retrieval Model outperforming specialized models and boosting online advertising metrics by 3%.

Authors:Yibo Yan, Shen Wang, Jiahao Huo, Jingheng Ye, Zhendong Chu, Xuming Hu, Philip S. Yu, Carla Gomes, Bart Selman, Qingsong Wen
Title: Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Abstract:
Scientific reasoning, the process through which humans apply logic, evidence, and critical thinking to explore and interpret scientific phenomena, is essential in advancing knowledge reasoning across diverse fields. However, despite significant progress, current scientific reasoning models still struggle with generalization across domains and often fall short of multimodal perception. Multimodal Large Language Models (MLLMs), which integrate text, images, and other modalities, present an exciting opportunity to overcome these limitations and enhance scientific reasoning. Therefore, this position paper argues that MLLMs can significantly advance scientific reasoning across disciplines such as mathematics, physics, chemistry, and biology. First, we propose a four-stage research roadmap of scientific reasoning capabilities, and highlight the current state of MLLM applications in scientific reasoning, noting their ability to integrate and reason over diverse data types. Second, we summarize the key challenges that remain obstacles to achieving MLLM's full potential. To address these challenges, we propose actionable insights and suggestions for the future. Overall, our work offers a novel perspective on MLLM integration with scientific reasoning, providing the LLM community with a valuable vision for achieving Artificial General Intelligence (AGI).
中文: 本文主张多模态大语言模型(MLLMs)能通过整合多源数据显著提升跨学科科学推理能力,提出了四阶段研究路线图并剖析现存挑战,为实现通用人工智能提供创新视角。
English: This position paper advocates for Multimodal Large Language Models (MLLMs) as a transformative solution to enhance scientific reasoning across disciplines by integrating diverse data types, proposing a research roadmap and addressing key challenges to advance toward Artificial General Intelligence.

Authors:Yifan Shen, Peiyuan Zhu, Zijian Li, Shaoan Xie, Zeyu Tang, Namrata Deka, Zongfang Liu, Guangyi Chen, Kun Zhang
Title: Controllable Video Generation with Provable Disentanglement
Abstract:
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling disentangled control of video generation. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
中文: CoVoGAN通过解耦静态与动态潜在变量,实现了对视频生成中独立概念的精确控制,显著提升了生成质量和可控性,适用于多样化的现实场景。
English: CoVoGAN addresses the challenge of controllable video generation by disentangling static and dynamic latent variables, enabling precise control over individual concepts and significantly improving video quality and controllability across real-world scenarios.

Authors:Yifan Shen, Peiyuan Zhu, Zijian Li, Shaoan Xie, Namrata Deka, Zongfang Liu, Zeyu Tang, Guangyi Chen, Kun Zhang
Title: Controllable Video Generation with Provable Disentanglement
Abstract:
Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose Controllable Video Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the minimal change principle, we first disentangle static and dynamic latent variables. We then leverage the sufficient change property to achieve component-wise identifiability of dynamic latent variables, enabling disentangled control of video generation. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a Temporal Transition Module to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.
中文: CoVoGAN通过解耦静态与动态潜在变量,实现了对视频生成中独立概念的精确控制,显著提升了生成质量和可控性,适用于多样化的现实场景。
English: CoVoGAN addresses the challenge of controllable video generation by disentangling static and dynamic latent variables, enabling precise control over individual concepts and significantly improving video quality and controllability across real-world scenarios.

Authors:Yufan Chen, Ruiping Liu, Junwei Zheng, Di Wen, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen
Title: Graph-based Document Structure Analysis
Abstract:
When reading a document, glancing at the spatial layout of a document is an initial step to understand it roughly. Traditional document layout analysis (DLA) methods, however, offer only a superficial parsing of documents, focusing on basic instance detection and often failing to capture the nuanced spatial and logical relations between instances. These limitations hinder DLA-based models from achieving a gradually deeper comprehension akin to human reading. In this work, we propose a novel graph-based Document Structure Analysis (gDSA) task. This task requires that model not only detects document elements but also generates spatial and logical relations in form of a graph structure, allowing to understand documents in a holistic and intuitive manner. For this new task, we construct a relation graph-based document structure analysis dataset (GraphDoc) with 80K document images and 4.13M relation annotations, enabling training models to complete multiple tasks like reading order, hierarchical structures analysis, and complex inter-element relation inference. Furthermore, a document relation graph generator (DRGG) is proposed to address the gDSA task, which achieves performance with 57.6% at mAP$_g$@0.5 for a strong benchmark baseline on this novel task and dataset. We hope this graphical representation of document structure can mark an innovative advancement in document structure analysis and understanding. The new dataset and code will be made publicly available at https://yufanchen96.github.io/projects/GraphDoc.
中文摘要:传统文档布局分析方法仅提供浅层解析,因此提出了一种新颖的基于图的文档结构分析任务,通过生成空间与逻辑关系图来实现整体文档理解。
English Summary: Traditional document layout analysis methods provide only superficial parsing, prompting the introduction of a novel graph-based Document Structure Analysis task that generates spatial and logical relations through a graph structure for holistic document understanding.

Authors:Sergei Kholkin, Ivan Butakov, Evgeny Burnaev, Nikita Gushchin, Alexander Korotin
Title: InfoBridge: Mutual Information estimation via Bridge Matching
Abstract:
Diffusion bridge models have recently become a powerful tool in the field of generative modeling. In this work, we leverage their power to address another important problem in machine learning and information theory, the estimation of the mutual information (MI) between two random variables. We show that by using the theory of diffusion bridges, one can construct an unbiased estimator for data posing difficulties for conventional MI estimators. We showcase the performance of our estimator on two standard MI estimation benchmarks, i.e., low-dimensional and image-based, and on real-world data, i.e., protein language model embeddings.
中文: 本研究利用扩散桥模型构建了一种新颖的无偏互信息估计器,将其巧妙构建为领域迁移问题,在标准测试基准和真实世界蛋白质语言模型嵌入数据上均展现出卓越性能。
English: This study introduces a novel unbiased mutual information estimator using diffusion bridge models, effectively addressing domain transfer challenges and demonstrating superior performance on standard benchmarks and real-world protein language model embeddings.

Authors:Sergei Kholkin, Ivan Butakov, Evgeny Burnaev, Nikita Gushchin, Alexander Korotin
Title: InfoBridge: Mutual Information estimation via Bridge Matching
Abstract:
Diffusion bridge models have recently become a powerful tool in the field of generative modeling. In this work, we leverage their power to address another important problem in machine learning and information theory, the estimation of the mutual information (MI) between two random variables. Neatly framing MI estimation as a domain transfer problem, we construct an unbiased estimator for data posing difficulties for conventional MI estimators. We showcase the performance of our estimator on three standard MI estimation benchmarks, i.e., low-dimensional, image-based and high MI, and on real-world data, i.e., protein language model embeddings.
中文: 本研究利用扩散桥模型构建了一种新颖的无偏互信息估计器,将其巧妙构建为领域迁移问题,在标准测试基准和真实世界蛋白质语言模型嵌入数据上均展现出卓越性能。
English: This study introduces a novel unbiased mutual information estimator using diffusion bridge models, effectively addressing domain transfer challenges and demonstrating superior performance on standard benchmarks and real-world protein language model embeddings.

Authors:Roman Tarasov, Petr Mokrov, Milena Gazdieva, Evgeny Burnaev, Alexander Korotin
Title: A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers
Abstract:
Neural network-based optimal transport (OT) is a recent and fruitful direction in the generative modeling community. It finds its applications in various fields such as domain translation, image super-resolution, computational biology and others. Among the existing OT approaches, of considerable interest are adversarial minimax solvers based on semi-dual formulations of OT problems. While promising, these methods lack theoretical investigation from a statistical learning perspective. Our work fills this gap by establishing upper bounds on the generalization error of an approximate OT map recovered by the minimax quadratic OT solver. Importantly, the bounds we derive depend solely on some standard statistical and mathematical properties of the considered functional classes (neural nets). While our analysis focuses on the quadratic OT, we believe that similar bounds could be derived for general OT case, paving the promising direction for future research.
中文: 本研究为基于神经网络的极小极大最优传输求解器建立了泛化误差界,其理论分析仅依赖于函数类的标准统计特性,并为更广泛最优传输问题的研究开辟了新方向。
English: This study establishes generalization error bounds for neural network-based optimal transport solvers using minimax approaches, providing theoretical foundations that depend on standard statistical properties of functional classes and suggesting applicability to broader OT problems.

Authors:Roman Tarasov, Petr Mokrov, Milena Gazdieva, Evgeny Burnaev, Alexander Korotin
Title: A Statistical Learning Perspective on Semi-dual Adversarial Neural Optimal Transport Solvers
Abstract:
Neural network-based optimal transport (OT) is a recent and fruitful direction in the generative modeling community. It finds its applications in various fields such as domain translation, image super-resolution, computational biology and others. Among the existing OT approaches, of considerable interest are adversarial minimax solvers based on semi-dual formulations of OT problems. While promising, these methods lack theoretical investigation from a statistical learning perspective. Our work fills this gap by establishing upper bounds on the generalization error of an approximate OT map recovered by the minimax quadratic OT solver. Importantly, the bounds we derive depend solely on some standard statistical and mathematical properties of the considered functional classes (neural nets). While our analysis focuses on the quadratic OT, we believe that similar bounds could be derived for general OT case, paving the promising direction for future research.
中文: 本研究为基于神经网络的极小极大最优传输求解器建立了泛化误差界,其理论分析仅依赖于函数类的标准统计特性,并为更广泛最优传输问题的研究开辟了新方向。
English: This study establishes generalization error bounds for neural network-based optimal transport solvers using minimax approaches, providing theoretical foundations that depend on standard statistical properties of functional classes and suggesting applicability to broader OT problems.

Authors:Yuanhuiyi Lyu, Xu Zheng, Lutao Jiang, Yibo Yan, Xin Zou, Huiyu Zhou, Linfeng Zhang, Xuming Hu
Title: RealRAG: Retrieval-augmented Realistic Image Generation via Self-reflective Contrastive Learning
Abstract:
Recent text-to-image generative models, e.g., Stable Diffusion V3 and Flux, have achieved notable progress. However, these models are strongly restricted to their limited knowledge, a.k.a., their own fixed parameters, that are trained with closed datasets. This leads to significant hallucinations or distortions when facing fine-grained and unseen novel real-world objects, e.g., the appearance of the Tesla Cybertruck. To this end, we present the first real-object-based retrieval-augmented generation framework (RealRAG), which augments fine-grained and unseen novel object generation by learning and retrieving real-world images to overcome the knowledge gaps of generative models. Specifically, to integrate missing memory for unseen novel object generation, we train a reflective retriever by self-reflective contrastive learning, which injects the generator's knowledge into the sef-reflective negatives, ensuring that the retrieved augmented images compensate for the model's missing knowledge. Furthermore, the real-object-based framework integrates fine-grained visual knowledge for the generative models, tackling the distortion problem and improving the realism for fine-grained object generation. Our Real-RAG is superior in its modular application to all types of state-of-the-art text-to-image generative models and also delivers remarkable performance boosts with all of them, such as a gain of 16.18% FID score with the auto-regressive model on the Stanford Car benchmark.
中文摘要:提出的RealRAG框架通过检索真实世界图像来弥补生成模型的知识空白,显著提升了细粒度物体生成的逼真度,并在各类先进模型中实现了性能突破。
English Summary: The proposed RealRAG framework enhances text-to-image generation by retrieving real-world images to fill knowledge gaps, significantly improving fine-grained object realism and performance across state-of-the-art models.

Authors:Tianci Liu, Ruirui Li, Zihan Dong, Hui Liu, Xianfeng Tang, Qingyu Yin, Linjun Zhang, Haoyu Wang, Jing Gao
Title: Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing
Abstract:
Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing (KE) to update specific knowledge in LLMs without changing unrelated others or compromising their pre-trained capabilities. Previous efforts sought to update a small amount of parameters of a LLM and proved effective for making selective updates. Nonetheless, the edited LLM often exhibits degraded ability to reason about the new knowledge. In this work, we identify a key issue: heterogeneous token overfitting (HTO), where the LLM overfits different tokens in the provided knowledge at varying rates. To tackle this, we propose OVERTONE, a token-level smoothing method that mitigates HTO by adaptively refining the target distribution. Theoretically, OVERTONE offers better parameter updates with negligible computation overhead. It also induces an implicit DPO but does not require preference data pairs. Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method.
中文: 大语言模型存在知识过时及编辑后推理能力下降的问题,OVERTONE通过自适应令牌级平滑缓解异质令牌过拟合,以可忽略的计算开销提升更新精度。
English: Large language models face knowledge obsolescence and reasoning degradation after editing, which OVERTONE addresses by mitigating heterogeneous token overfitting through adaptive token-level smoothing, enhancing update precision without extra computation.

Authors:Yu He, Boheng Li, Liu Liu, Zhongjie Ba, Wei Dong, Yiming Li, Zhan Qin, Kui Ren, Chun Chen
Title: Towards Label-Only Membership Inference Attack against Pre-trained Large Language Models
Abstract:
Membership Inference Attacks (MIAs) aim to predict whether a data sample belongs to the model's training set or not. Although prior research has extensively explored MIAs in Large Language Models (LLMs), they typically require accessing to complete output logits (\ie, \textit{logits-based attacks}), which are usually not available in practice. In this paper, we study the vulnerability of pre-trained LLMs to MIAs in the \textit{label-only setting}, where the adversary can only access generated tokens (text). We first reveal that existing label-only MIAs have minor effects in attacking pre-trained LLMs, although they are highly effective in inferring fine-tuning datasets used for personalized LLMs. We find that their failure stems from two main reasons, including better generalization and overly coarse perturbation. Specifically, due to the extensive pre-training corpora and exposing each sample only a few times, LLMs exhibit minimal robustness differences between members and non-members. This makes token-level perturbations too coarse to capture such differences. To alleviate these problems, we propose \textbf{PETAL}: a label-only membership inference attack based on \textbf{PE}r-\textbf{T}oken sem\textbf{A}ntic simi\textbf{L}arity. Specifically, PETAL leverages token-level semantic similarity to approximate output probabilities and subsequently calculate the perplexity. It finally exposes membership based on the common assumption that members are `better' memorized and have smaller perplexity. We conduct extensive experiments on the WikiMIA benchmark and the more challenging MIMIR benchmark. Empirically, our PETAL performs better than the extensions of existing label-only attacks against personalized LLMs and even on par with other advanced logit-based attacks across all metrics on five prevalent open-source LLMs.
中文摘要:本文提出PETAL,一种基于逐词元语义相似度的标签式成员推断攻击方法,无需完整输出逻辑值即可有效识别预训练大语言模型中的训练数据成员,在多项指标上优于现有方法。
English Summary: This paper introduces PETAL, a label-only membership inference attack that uses per-token semantic similarity to effectively determine if data samples were used in pre-training large language models, outperforming existing methods without requiring full output logits.

Authors:Xueqing Peng, Triantafillos Papadopoulos, Efstathia Soufleri, Polydoros Giannouris, Ruoyu Xiang, Yan Wang, Lingfei Qian, Jimin Huang, Qianqian Xie, Sophia Ananiadou
Title: Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
Abstract:
Despite Greece's pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.
Chinese: 由于希腊语的语言复杂性和领域特定数据的缺乏,希腊金融自然语言处理发展不足,为此我们推出了首个希腊金融评估基准Plutus-ben和专业希腊金融大模型Plutus-8B,以填补这一空白并推动该领域发展。
English: Due to the linguistic complexity of Greek and a lack of domain-specific data, Greek financial NLP has been underdeveloped, prompting the creation of Plutus-ben, the first Greek financial benchmark, and Plutus-8B, a specialized Greek financial LLM, to address this gap and advance the field.

Authors:Shahrzad Kiani, Nupur Kulkarni, Adam Dziedzic, Stark Draper, Franziska Boenisch
Title: Differentially Private Federated Learning With Time-Adaptive Privacy Spending
Abstract:
Federated learning (FL) with differential privacy (DP) provides a framework for collaborative machine learning, enabling clients to train a shared model while adhering to strict privacy constraints. The framework allows each client to have an individual privacy guarantee, e.g., by adding different amounts of noise to each client's model updates. One underlying assumption is that all clients spend their privacy budgets uniformly over time (learning rounds). However, it has been shown in the literature that learning in early rounds typically focuses on more coarse-grained features that can be learned at lower signal-to-noise ratios while later rounds learn fine-grained features that benefit from higher signal-to-noise ratios. Building on this intuition, we propose a time-adaptive DP-FL framework that expends the privacy budget non-uniformly across both time and clients. Our framework enables each client to save privacy budget in early rounds so as to be able to spend more in later rounds when additional accuracy is beneficial in learning more fine-grained features. We theoretically prove utility improvements in the case that clients with stricter privacy budgets spend budgets unevenly across rounds, compared to clients with more relaxed budgets, who have sufficient budgets to distribute their spend more evenly. Our practical experiments on standard benchmark datasets support our theoretical results and show that, in practice, our algorithms improve the privacy-utility trade-offs compared to baseline schemes.
中文: 提出的时间自适应差分隐私联邦学习框架在轮次和客户端间非均匀分配隐私预算,允许早期节省预算以提升后期细粒度特征学习的准确性,理论和实验均验证了其优于基准方案的隐私-效用权衡改进。
English: The proposed time-adaptive differentially private federated learning framework non-uniformly allocates privacy budgets across rounds and clients, enabling early-round savings for enhanced later-round accuracy in fine-grained feature learning, with theoretical and experimental validation showing improved privacy-utility trade-offs.

Authors:Hongru Li, Hang Zhao, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Title: Remote Training in Task-Oriented Communication: Supervised or Self-Supervised with Fine-Tuning?
Abstract:
Task-oriented communication focuses on extracting and transmitting only the information relevant to specific tasks, effectively minimizing communication overhead. Most existing methods prioritize reducing this overhead during inference, often assuming feasible local training or minimal training communication resources. However, in real-world wireless systems with dynamic connection topologies, training models locally for each new connection is impractical, and task-specific information is often unavailable before establishing connections. Therefore, minimizing training overhead and enabling label-free, task-agnostic pre-training before the connection establishment are essential for effective task-oriented communication. In this paper, we tackle these challenges by employing a mutual information maximization approach grounded in self-supervised learning and information-theoretic analysis. We propose an efficient strategy that pre-trains the transmitter in a task-agnostic and label-free manner, followed by joint fine-tuning of both the transmitter and receiver in a task-specific, label-aware manner. Simulation results show that our proposed method reduces training communication overhead to about half that of full-supervised methods using the SGD optimizer, demonstrating significant improvements in training efficiency.
中文: 任务导向通信通过仅传输任务相关信息来最小化开销,本文提出一种自监督预训练方法,将训练通信成本较全监督方法降低约一半。
English: Task-oriented communication minimizes overhead by transmitting only task-relevant information, and this paper proposes a self-supervised pre-training method that reduces training communication costs by half compared to fully supervised approaches.

Authors:Nurkhan Laiyk, Daniil Orel, Rituraj Joshi, Maiya Goloburda, Yuxia Wang, Preslav Nakov, Fajri Koto
Title: Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Abstract:
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
中文: 本研究针对哈萨克斯坦的制度与文化知识,推出了大规模人工核验的指令数据集,证明基于该数据对大语言模型进行微调,能持续提升低资源语言任务的表现。
English: This study introduces a large-scale, manually verified instruction-following dataset for Kazakhstan's institutional and cultural knowledge, demonstrating that fine-tuning LLMs with this data consistently improves performance in low-resource language tasks.

Authors:Maiya Goloburda, Nurkhan Laiyk, Diana Turmakhan, Yuxia Wang, Mukhammed Togmanov, Jonibek Mansurov, Askhat Sametov, Nurdaulet Mukhituly, Minghan Wang, Daniil Orel, Zain Muhammad Mujahid, Fajri Koto, Timothy Baldwin, Preslav Nakov
Title: Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts
Abstract:
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorgau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
中文: 本文介绍了Qorgau双语数据集,用于评估哈萨克语和俄语的大语言模型安全性,揭示了性能显著差异,并强调了在哈萨克斯坦等多语国家需要针对特定区域的数据以确保人工智能的安全部署。
English: This paper introduces Qorgau, a bilingual dataset for evaluating LLM safety in Kazakh and Russian, highlighting significant performance differences and the necessity of region-specific data for responsible AI deployment in countries like Kazakhstan.

Authors:Naibin Gu, Zhenyu Zhang, Xiyu Liu, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Title: BeamLoRA: Beam-Constraint Low-Rank Adaptation
Abstract:
Due to the demand for efficient fine-tuning of large language models, Low-Rank Adaptation (LoRA) has been widely adopted as one of the most effective parameter-efficient fine-tuning methods. Nevertheless, while LoRA improves efficiency, there remains room for improvement in accuracy. Herein, we adopt a novel perspective to assess the characteristics of LoRA ranks. The results reveal that different ranks within the LoRA modules not only exhibit varying levels of importance but also evolve dynamically throughout the fine-tuning process, which may limit the performance of LoRA. Based on these findings, we propose BeamLoRA, which conceptualizes each LoRA module as a beam where each rank naturally corresponds to a potential sub-solution, and the fine-tuning process becomes a search for the optimal sub-solution combination. BeamLoRA dynamically eliminates underperforming sub-solutions while expanding the parameter space for promising ones, enhancing performance with a fixed rank. Extensive experiments across three base models and 12 datasets spanning math reasoning, code generation, and commonsense reasoning demonstrate that BeamLoRA consistently enhances the performance of LoRA, surpassing the other baseline methods.
中文: BeamLoRA通过动态优化LoRA秩子解决方案,在多个模型和任务中显著提升了性能表现。
English: BeamLoRA enhances LoRA's performance by dynamically optimizing rank sub-solutions during fine-tuning, achieving superior results across multiple models and tasks.

Authors:Ruidong Han, Zhou Yang, Chengyan Ma, Ye Liu, Yuqing Niu, Siqi Ma, Debin Gao, David Lo
Title: AutoTEE: Automated Migration and Protection of Programs in Trusted Execution Environments
Abstract:
Trusted Execution Environments (TEEs) isolate a special space within a device's memory that is not accessible to the normal world (also known as Untrusted Environment), even when the device is compromised. Thus, developers can utilize TEEs to provide strong security guarantees for their programs, making sensitive operations like encrypted data storage, fingerprint verification, and remote attestation protected from malicious attacks. Despite the strong protections offered by TEEs, adapting existing programs to leverage such security guarantees is non-trivial, often requiring extensive domain knowledge and manual intervention, which makes TEEs less accessible to developers. This motivates us to design AutoTEE, the first Large Language Model (LLM)-enabled approach that can automatically identify, partition, transform, and port sensitive functions into TEEs with minimal developer intervention. By manually reviewing 68 repositories, we constructed a benchmark dataset consisting of 385 sensitive functions eligible for transformation, on which AutoTEE achieves a high F1 score of 0.91. AutoTEE effectively transforms these sensitive functions into their TEE-compatible counterparts, achieving success rates of 90\% and 83\% for Java and Python, respectively. We further provide a mechanism to automatically port the transformed code to different TEE platforms, including Intel SGX and AMD SEV, demonstrating that the transformed programs run successfully and correctly on these platforms.
中文:AutoTEE作为首个基于大语言模型的自动化方案,能够以91%的F1值精准识别敏感函数并转化为可信执行环境兼容代码,成功实现Java 90%与Python 83%的转换率,同时支持跨平台部署至Intel SGX和AMD SEV环境。
English: AutoTEE is an innovative LLM-based system that automatically identifies and transforms sensitive functions into TEE-compatible code with minimal developer input, achieving high accuracy rates of 90% for Java and 83% for Python while enabling seamless portability across platforms like Intel SGX and AMD SEV.

Authors:Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto
Title: KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan
Abstract:
Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan's bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings underscore significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs. Data and code will be made available upon acceptance.
中文摘要:哈萨克斯坦的哈萨克语在自然语言处理领域代表性不足,为此我们开发了首个哈萨克语MMLU风格数据集KazMMLU,该数据集揭示了当前多语言模型的显著性能差距,旨在推动哈萨克语中心语言模型的进一步发展。
English Summary: Kazakhstan's Kazakh language is underrepresented in natural language processing, prompting the creation of KazMMLU, the first MMLU-style dataset for Kazakh, which reveals significant performance gaps in current multilingual models and aims to spur further development of Kazakh-centric language models.

Authors:Kai Zhang, Hengtao He, Shenghui Song, Jun Zhang, Khaled B. Letaief
Title: Distributed On-Device LLM Inference With Over-the-Air Computation
Abstract:
Large language models (LLMs) have achieved remarkable success across various artificial intelligence tasks. However, their enormous sizes and computational demands pose significant challenges for the deployment on edge devices. To address this issue, we present a distributed on-device LLM inference framework based on tensor parallelism, which partitions neural network tensors (e.g., weight matrices) of LLMs among multiple edge devices for collaborative inference. Nevertheless, tensor parallelism involves frequent all-reduce operations to aggregate intermediate layer outputs across participating devices during inference, resulting in substantial communication overhead. To mitigate this bottleneck, we propose an over-the-air computation method that leverages the analog superposition property of wireless multiple-access channels to facilitate fast all-reduce operations. To minimize the average transmission mean-squared error, we investigate joint model assignment and transceiver optimization, which can be formulated as a mixed-timescale stochastic non-convex optimization problem. Then, we develop a mixed-timescale algorithm leveraging semidefinite relaxation and stochastic successive convex approximation methods. Comprehensive simulation results will show that the proposed approach significantly reduces inference latency while improving accuracy. This makes distributed on-device LLM inference practical for resource-constrained edge devices.
中文摘要:本文提出一种基于张量并行和空中计算的分布式设备端大语言模型推理框架,通过联合优化模型分配与收发机设计,显著降低边缘设备推理延迟并提升精度。
English Summary: This paper introduces a distributed on-device LLM inference framework using tensor parallelism and over-the-air computation to reduce communication overhead, significantly lowering inference latency while improving accuracy for edge devices.

Authors:Sarthak Mittal, Yoshua Bengio, Nikolay Malkin, Guillaume Lajoie
Title: In-Context Parametric Inference: Point or Distribution Estimators?
Abstract:
Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random variables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.
中文摘要:贝叶斯与频率推断是统计估计的两大基本范式,频率主义点估计因贝叶斯方法计算复杂而在深度学习领域占主导,但二者在摊销估计中的权衡关系尚待系统研究,需通过多场景实验验证其性能差异。
English summary: Bayesian and frequentist inference represent two core statistical paradigms, with frequentist point estimation dominating deep learning due to Bayesian methods' computational challenges, though their comparative trade-offs in amortized estimation remain underexplored across various problem settings.

Authors:Yuxia Wang, Rui Xing, Jonibek Mansurov, Giovanni Puccetti, Zhuohan Xie, Minh Ngoc Ta, Jiahui Geng, Jinyan Su, Mervat Abassy, Saad El Dine Ahmed, Kareem Elozeiri, Nurkhan Laiyk, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Alexander Aziz, Ryuto Koike, Masahiro Kaneko, Artem Shelmanov, Ekaterina Artemova, Vladislav Mikhailov, Akim Tsvigun, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
Title: Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI
Abstract:
Prior studies have shown that distinguishing text generated by large language models (LLMs) from human-written one is highly challenging, and often no better than random guessing. To verify the generalizability of this finding across languages and domains, we perform an extensive case study to identify the upper bound of human detection accuracy. Across 16 datasets covering 9 languages and 9 domains, 19 annotators achieved an average detection accuracy of 87.6\%, thus challenging previous conclusions. We find that major gaps between human and machine text lie in concreteness, cultural nuances, and diversity. Prompting by explicitly explaining the distinctions in the prompts can partially bridge the gaps in over 50\% of the cases. However, we also find that humans do not always prefer human-written text, particularly when they cannot clearly identify its source.
中文摘要:本研究通过对9种语言和9个领域的广泛分析,推翻了先前关于区分机器生成与人类撰写文本仅能随机猜测的结论,证明人类识别准确率可达87.6%,主要依据具体性、文化细微差别和多样性差异,但明确提示仅能部分弥补这些差距。
English Summary: This study challenges prior claims that distinguishing LLM-generated text from human writing is no better than random, demonstrating through cross-linguistic analysis that humans can achieve 87.6% detection accuracy by identifying differences in concreteness, cultural nuances, and diversity, though explicit prompting only partially bridges these gaps.

Authors:Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang
Title: ADO: Automatic Data Optimization for Inputs in LLM Prompts
Abstract:
This study explores a novel approach to enhance the performance of Large Language Models (LLMs) through the optimization of input data within prompts. While previous research has primarily focused on refining instruction components and augmenting input data with in-context examples, our work investigates the potential benefits of optimizing the input data itself. We introduce a two-pronged strategy for input data optimization: content engineering and structural reformulation. Content engineering involves imputing missing values, removing irrelevant attributes, and enriching profiles by generating additional information inferred from existing attributes. Subsequent to content engineering, structural reformulation is applied to optimize the presentation of the modified content to LLMs, given their sensitivity to input format. Our findings suggest that these optimizations can significantly improve the performance of LLMs in various tasks, offering a promising avenue for future research in prompt engineering. The source code is available at https://anonymous.4open.science/r/ADO-6BC5/
中文摘要:本研究提出了一种新颖的双重策略——内容工程和结构重构,通过优化提示中的输入数据,显著提升大型语言模型在多种任务中的性能表现。
English Summary: This research introduces a novel two-step method—content engineering and structural reformulation—to optimize input data in prompts, significantly enhancing Large Language Models' performance across various tasks.

Authors:Guojun Xiong, Zhiyang Deng, Keyi Wang, Yupeng Cao, Haohang Li, Yangyang Yu, Xueqing Peng, Mingquan Lin, Kaleb E Smith, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou, Qianqian Xie
Title: FLAG-Trader: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading
Abstract:
Large language models (LLMs) fine-tuned on multimodal financial data have demonstrated impressive reasoning capabilities in various financial tasks. However, they often struggle with multi-step, goal-oriented scenarios in interactive financial markets, such as trading, where complex agentic approaches are required to improve decision-making. To address this, we propose \textsc{FLAG-Trader}, a unified architecture integrating linguistic processing (via LLMs) with gradient-driven reinforcement learning (RL) policy optimization, in which a partially fine-tuned LLM acts as the policy network, leveraging pre-trained knowledge while adapting to the financial domain through parameter-efficient fine-tuning. Through policy gradient optimization driven by trading rewards, our framework not only enhances LLM performance in trading but also improves results on other financial-domain tasks. We present extensive empirical evidence to validate these enhancements.
Chinese: 提出的FLAG-Trader架构通过将语言模型与强化学习相结合,利用策略梯度优化来提升金融交易及其他任务的决策能力。
English: The proposed FLAG-Trader architecture combines language models with reinforcement learning to enhance decision-making in financial trading and other tasks through policy gradient optimization.

Authors:Qingchao Li, Mohammed El-Hajjar, Chao Xu, Jiancheng An, Chau Yuen, Lajos Hanzo
Title: Stacked Intelligent Metasurface-Based Transceiver Design for Near-Field Wideband Systems
Abstract:
Intelligent metasurfaces may be harnessed for realizing efficient holographic multiple-input and multiple-output (MIMO) systems, at a low hardware-cost and high energy-efficiency. As part of this family, we propose a hybrid beamforming design for stacked intelligent metasurfaces (SIM) aided wideband wireless systems relying on the near-field channel model. Specifically, the holographic beamformer is designed based on configuring the phase shifts in each layer of the SIM for maximizing the sum of the baseband eigen-channel gains of all users. To optimize the SIM phase shifts, we propose a layer-by-layer iterative algorithm for optimizing the phase shifts in each layer alternately. Then, the minimum mean square error (MMSE) transmit precoding method is employed for the digital beamformer to support multi-user access. Furthermore, the mitigation of the SIM phase tuning error is also taken into account in the digital beamformer by exploiting its statistics. The power sharing ratio of each user is designed based on the iterative waterfilling power allocation algorithm. Additionally, our analytical results indicate that the spectral efficiency attained saturates in the high signal-to-noise ratio (SNR) region due to the phase tuning error resulting from the imperfect SIM hardware quality. The simulation results show that the SIM-aided holographic MIMO outperforms the state-of-the-art (SoA) single-layer holographic MIMO in terms of its achievable rate. We further demonstrate that the near-field channel model allows the SIM-based transceiver design to support multiple users, since the spatial resources represented both by the angle domain and the distance domain can be exploited.
中文: 智能超表面可实现低成本、高能效的全息MIMO系统,通过堆叠智能超表面的分层波束成形设计优化相位偏移,利用近场信道模型支持多用户接入,在实现速率上优于现有单层系统。
English: Intelligent metasurfaces enable cost-effective and energy-efficient holographic MIMO systems, with a proposed hybrid beamforming design using stacked intelligent metasurfaces to optimize phase shifts and support multi-user access in near-field channels, outperforming current single-layer systems in achievable rate.

Authors:Zhaoqian Xue, Guanhong Liu, Kai Wei, Chong Zhang, Qingcheng Zeng, Songhua Hu, Wenyue Hua, Lizhou Fan, Yongfeng Zhang, Lingyao Li
Title: Toward Equitable Access: Leveraging Crowdsourced Reviews to Investigate Public Perceptions of Health Resource Accessibility
Abstract:
Access to health resources is a critical determinant of public well-being and societal resilience, particularly during public health crises when demand for medical services and preventive care surges. However, disparities in accessibility persist across demographic and geographic groups, raising concerns about equity. Traditional survey methods often fall short due to limitations in coverage, cost, and timeliness. This study leverages crowdsourced data from Google Maps reviews, applying advanced natural language processing techniques, specifically ModernBERT, to extract insights on public perceptions of health resource accessibility in the United States during the COVID-19 pandemic. Additionally, we employ Partial Least Squares regression to examine the relationship between accessibility perceptions and key socioeconomic and demographic factors including political affiliation, racial composition, and educational attainment. Our findings reveal that public perceptions of health resource accessibility varied significantly across the U.S., with disparities peaking during the pandemic and slightly easing post-crisis. Political affiliation, racial demographics, and education levels emerged as key factors shaping these perceptions. These findings underscore the need for targeted interventions and policy measures to address inequities, fostering a more inclusive healthcare infrastructure that can better withstand future public health challenges.
本研究利用谷歌地图评论和自然语言处理技术分析COVID-19期间美国医疗资源可及性,发现受政治倾向、种族构成和教育水平影响的差异性,提示需要针对性政策干预。
This study uses Google Maps reviews and NLP to analyze U.S. health resource accessibility during COVID-19, revealing disparities influenced by political, racial, and educational factors that call for targeted policy interventions.

Authors:Łukasz Staniszewski, Bartosz Cywiński, Franziska Boenisch, Kamil Deja, Adam Dziedzic
Title: Precise Parameter Localization for Textual Generation in Diffusion Models
Abstract:
Novel diffusion models can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of diffusion models' parameters, all contained in attention layers, influence the generation of textual content within the images. Building on this observation, we improve textual generation efficiency and performance by targeting cross and joint attention layers of diffusion models. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large diffusion models while preserving the quality and diversity of the diffusion models' generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP to the large language models like T5). Project page available at https://t2i-text-loc.github.io/.
中文摘要:新型扩散模型能生成融合高质量文本的逼真图像,研究发现仅需调整注意力层中不足1%的参数即可控制文本生成,从而实现了针对性优化及文本编辑、防毒性内容等应用,适用于多种模型架构。
English Summary: Novel diffusion models can generate realistic images with integrated text, and researchers have discovered that less than 1% of parameters in attention layers control text generation, enabling targeted improvements and applications like text editing and toxicity prevention across various model architectures.

Authors:Ang Li, Yichuan Mo, Mingjie Li, Yifei Wang, Yisen Wang
Title: Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning
Abstract:
Large Language Models (LLMs) have demonstrated remarkable success across various NLP benchmarks. However, excelling in complex tasks that require nuanced reasoning and precise decision-making demands more than raw language proficiency--LLMs must reason, i.e., think logically, draw from past experiences, and synthesize information to reach conclusions and take action. To enhance reasoning abilities, approaches such as prompting and fine-tuning have been widely explored. While these methods have led to clear improvements in reasoning, their impact on LLM safety remains less understood. In this work, we investigate the interplay between reasoning and safety in LLMs. We highlight the latent safety risks that arise as reasoning capabilities improve, shedding light on previously overlooked vulnerabilities. At the same time, we explore how reasoning itself can be leveraged to enhance safety, uncovering potential mitigation strategies. By examining both the risks and opportunities in reasoning-driven LLM safety, our study provides valuable insights for developing models that are not only more capable but also more trustworthy in real-world deployments.
Chinese: 本研究探讨了大语言模型推理能力与安全性之间的双重关系,发现增强推理可能引发潜在安全风险,同时也为提升模型可信度提供了可能的解决策略。
English: This study explores the dual relationship between reasoning capabilities and safety in Large Language Models, revealing that enhanced reasoning can introduce latent security risks while also offering potential strategies to improve model trustworthiness.

Authors:Lingfei Qian, Weipeng Zhou, Yan Wang, Xueqing Peng, Han Yi, Yilun Zhao, Jimin Huang, Qianqian Xie, Jian-yun Nie
Title: Fino1: On the Transferability of Reasoning-Enhanced LLMs and Reinforcement Learning to Finance
Abstract:
As the fundamental capability behind decision-making in finance, financial reasoning poses distinct challenges for LLMs. Although reinforcement learning (RL) have boosted generic reasoning, the progress in finance is hindered by the absence of empirical study of building effective financial chain-of-thought (CoT) corpus, a systematic comparison of different RL methods, and comprehensive benchmarks. To address these gaps, we introduce FinCoT, the first open high-fidelity CoT corpus for finance, distilled from seven QA datasets by a novel three-stage pipeline that incorporates domain supervision, iterative LLM refinement, and difficulty-aware filtering. Based on FinCoT, we develop Fin-o1, the first open financial reasoning models trained via supervised fine-tuning and GRPO-based RL. Our models outperform existing financial reasoning models and SOTA general models such as GPT-o1, DeepSeek-R1, and GPT-4.5. We also investigate the effectiveness of three different RL methods in improving domain-specific reasoning, offering the first such empirical study. We finally propose FinReason, the first financial reasoning benchmark covering multi-table analysis, long-context reasoning, and equation-based tasks, and evaluate 29 LLMs. Our extensive experiments reveal general reasoning models excel on standard benchmarks yet exhibit obvious performance degradation in financial contexts; even finance-tuned models like Dianjin-R1 and FinR1 degrade on lengthy documents. In contrast, our Fin-o1 models consistently outperform their backbones and larger GPT-o1 and DeepSeek-R1, confirming the effectiveness of our data building and model training strategy. Our study further shows that GRPO yields reliable gains whereas PPO and DPO do not, highlighting the need for targeted data and optimisation rather than scale alone.
中文: 本研究推出了首个开放式高保真金融思维链语料库FinCoT,并通过监督微调和基于GRPO的强化学习开发出超越现有金融及通用推理模型的Fin-o1模型,同时提出FinReason基准来系统评估金融推理能力。
English: This study introduces FinCoT, the first open high-fidelity chain-of-thought corpus for finance, and develops Fin-o1 models that outperform existing financial and general reasoning models through supervised fine-tuning and GRPO-based reinforcement learning, while also proposing the FinReason benchmark to evaluate financial reasoning capabilities.

Authors:Wenhao Wang, Adam Dziedzic, Grace C. Kim, Michael Backes, Franziska Boenisch
Title: Captured by Captions: On Memorization and its Mitigation in CLIP Models
Abstract:
Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP's memorization behavior falls between the supervised and self-supervised paradigms, with "mis-captioned" samples exhibiting highest levels of memorization. Additionally, we find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain. Building on these insights, we propose multiple strategies to reduce memorization while at the same time improving utility--something that had not been shown before for traditional learning paradigms where reducing memorization typically results in utility decrease.
中文: 本研究定义了CLIP模型中的记忆机制(CLIPMem),发现其记忆行为介于监督与自监督学习之间,其中文本编码器对记忆贡献更大,并提出可在不降低模型效用的情况下减少记忆的策略。
English: This study defines memorization in CLIP models (CLIPMem) and finds that their memorization behavior lies between supervised and self-supervised learning, with text encoders contributing more to memorization, while proposing strategies that reduce memorization without sacrificing utility.

Authors:Jiajun Shi, Chaoren Wei, Liqun Yang, Zekun Moore Wang, Chenghao Yang, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang, Zhoufutu Wen
Title: CryptoX : Compositional Reasoning Evaluation of Large Language Models
Abstract:
The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner mechanism of LLMs' compositional reasoning, involving subproblem decomposition, subproblem inference, and summarizing subproblem conclusions. Through analysis based on CryptoBench, we highlight the value of independently studying compositional reasoning and emphasize the need to enhance the compositional reasoning capabilities of LLMs.
Chinese: 本文提出CryptoX评估框架,首次通过结合密码学原理构建CryptoBench来量化大语言模型的组合推理能力,实验揭示了开源与闭源模型间的显著差距,并强调提升组合推理能力的必要性。
English: This paper introduces CryptoX, a novel evaluation framework that quantifies the compositional reasoning capacity of large language models (LLMs) through CryptoBench, revealing a significant performance gap between open-source and closed-source models and emphasizing the need for enhanced reasoning capabilities.

Authors:Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, Yisen Wang
Title: When More is Less: Understanding Chain-of-Thought Length in LLMs
Abstract:
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length's scaling laws and the emergence of simplicity bias during RL. Guided by this framework, we demonstrate significant practical benefits from training with optimally-lengthed CoTs and employing length-aware filtering at inference. These findings offer both a principled understanding of the "overthinking" phenomenon and multiple practical guidelines for CoT calibration, enabling LLMs to achieve optimal reasoning performance with adaptive CoTs tailored to task complexity and model capability.
中文: 本研究推翻思维链越长越有效的假设,揭示任务表现与推理长度呈倒U型关系,最优长度随任务难度增加而增加、随模型能力增强而减少,并提出了适配不同场景的实用校准方案。
English: This study refutes the assumption that longer Chain-of-Thought reasoning always improves performance, revealing an inverted U-shaped relationship where optimal length depends on task difficulty and model capability, with practical applications for adaptive reasoning calibration.

Authors:Mengxi Xiao, Zihao Jiang, Lingfei Qian, Zhengyu Chen, Yueru He, Yijing Xu, Yuecheng Jiang, Dong Li, Ruey-Ling Weng, Min Peng, Jimin Huang, Sophia Ananiadou, Qianqian Xie
Title: Retrieval-augmented Large Language Models for Financial Time Series Forecasting
Abstract:
Accurately forecasting stock price movements is critical for informed financial decision-making, supporting applications ranging from algorithmic trading to risk management. However, this task remains challenging due to the difficulty of retrieving subtle yet high-impact patterns from noisy financial time-series data, where conventional retrieval methods, whether based on generic language models or simplistic numeric similarity, often fail to capture the intricate temporal dependencies and context-specific signals essential for precise market prediction. To bridge this gap, we introduce FinSrag, the first retrieval-augmented generation (RAG) framework with a novel domain-specific retriever FinSeer for financial time-series forecasting. FinSeer leverages a candidate selection mechanism refined by LLM feedback and a similarity-driven training objective to align queries with historically influential sequences while filtering out financial noise. Such training enables FinSeer to identify the most relevant time-series data segments for downstream forecasting tasks, unlike embedding or distance-based retrieval methods used in existing RAG frameworks. The retrieved patterns are then fed into StockLLM, a 1B-parameter LLM fine-tuned for stock movement prediction, which serves as the generative backbone. Beyond the retrieval method, we enrich the retrieval corpus by curating new datasets that integrate a broader set of financial indicators, capturing previously overlooked market dynamics. Experiments demonstrate that FinSeer outperforms existing textual retrievers and traditional distance-based retrieval approaches in enhancing the prediction accuracy of StockLLM, underscoring the importance of domain-specific retrieval frameworks in handling the complexity of financial time-series data.
中文: FinSrag提出了首个针对金融时间序列预测的检索增强生成框架,其FinSeer检索器通过筛选噪声和识别关键时序模式,结合微调的StockLLM模型显著提升了股价预测的准确性。
English: FinSrag introduces a domain-specific retrieval framework with FinSeer to enhance stock forecasting by filtering financial noise and identifying relevant time-series patterns, improving prediction accuracy when integrated with the fine-tuned StockLLM model.

Authors:Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch
Title: Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images
Abstract:
State-of-the-art Diffusion Models (DMs) produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, SDXL, Flux, DeepFloyd IF) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we introduce a novel fine-tuning strategy that targets only the text-generation layers in DMs. Therefore, we construct a safety fine-tuning dataset by pairing each NSFW prompt with two images: one with the NSFW term, and another where that term is replaced with a carefully crafted benign alternative while leaving the image unchanged otherwise. By training on this dataset, the model learns to avoid generating harmful text while preserving benign content and overall image quality. Finally, to advance research in the area, we release ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. It includes our curated fine-tuning dataset, a set of harmful prompts, new evaluation metrics, and a pipeline that assesses both NSFW-ness and text and image quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models, thereby contributing to their safe deployment. The benchmark is available online for download.
中文摘要:现有先进扩散模型存在在图像中生成有害文本的安全隐患,本文提出一种针对文本生成层的微调方法以消除风险同时保持图像质量,并发布了开源基准ToxicBench以推动相关研究。
English Summary: State-of-the-art diffusion models are vulnerable to generating harmful text embedded in images, and a new fine-tuning method is introduced to mitigate this while preserving image quality, alongside the release of an open-source benchmark called ToxicBench.

Authors:Baiyang Liu, Kin-Fai Tong, Kai-Kit Wong, Chan-Byoung Chae, Hang Wong
Title: Be Water, My Antennas: Riding on Radio Wave Fluctuation in Nature for Spatial Multiplexing using Programmable Meta-Fluid Antenna
Abstract:
Interference and scattering, often deemed undesirable, are inevitable in wireless communications, especially when the current mobile networks and upcoming sixth generation (6G) have turned into ultra-dense networks. Current approaches relying on multiple-input multiple-output (MIMO) combined with artificial-intelligence-aided (AI) signal processing have drawbacks of being power-hungry and requiring wide bandwidth that raise scalability concerns. In this article, we take a radical approach and utilize the channel fading phenomenon to our advantage. Specifically, we propose a novel meta-fluid antenna architecture, referred to as the `fluid' antenna system (FAS), that can freely surf on radio wave fluctuations, like `fluid' figuratively speaking, with fine resolution in space to opportunistically avoid interference, eliminating the need for expensive signal processing. Our experimental results demonstrate that under rich scattering conditions, the proposed meta-fluidic architecture is able to exploit the natural ups and downs of radio waves in space for spatial multiplexing. These breakthrough results show that scattering can be desirable not harmful and interference can be dodged not suppressed, fundamentally changing our perception of fading and our understanding on how interference should be managed in wireless communications networks.
The proposed fluid antenna system (FAS) leverages channel fading to opportunistically avoid interference through spatial surfing of radio wave fluctuations, fundamentally transforming interference management by making scattering beneficial rather than detrimental in wireless networks.
English Summary:

Authors:Ziwei Wang, Jie Zhou, Qin Chen, Min Zhang, Bo Jiang, Aimin Zhou, Qinchun Bai, Liang He
Title: LLM-KT: Aligning Large Language Models with Knowledge Tracing using a Plug-and-Play Instruction
Abstract:
The knowledge tracing (KT) problem is an extremely important topic in personalized education, which aims to predict whether students can correctly answer the next question based on their past question-answer records. Prior work on this task mainly focused on learning the sequence of behaviors based on the IDs or textual information. However, these studies usually fail to capture students' sufficient behavioral patterns without reasoning with rich world knowledge about questions. In this paper, we propose a large language models (LLMs)-based framework for KT, named \texttt{\textbf{LLM-KT}}, to integrate the strengths of LLMs and traditional sequence interaction models. For task-level alignment, we design Plug-and-Play instruction to align LLMs with KT, leveraging LLMs' rich knowledge and powerful reasoning capacity. For modality-level alignment, we design the plug-in context and sequence to integrate multiple modalities learned by traditional methods. To capture the long context of history records, we present a plug-in context to flexibly insert the compressed context embedding into LLMs using question-specific and concept-specific tokens. Furthermore, we introduce a plug-in sequence to enhance LLMs with sequence interaction behavior representation learned by traditional sequence models using a sequence adapter. Extensive experiments show that \texttt{\textbf{LLM-KT}} obtains state-of-the-art performance on four typical datasets by comparing it with approximately 20 strong baselines.
中文摘要:本文提出LLM-KT框架,通过整合大语言模型与传统序列模型,利用世界知识和多模态对齐来改进知识追踪任务。
English Summary: The paper introduces LLM-KT, a framework that combines large language models with traditional sequence models to enhance knowledge tracing by leveraging world knowledge and multimodal data alignment.

Authors:Chenlu Ding, Jiancan Wu, Yancheng Yuan, Cunchun Li, Xiang Wang, Dingxian Wang, Frank Yang, Andrew Rabinovich
Title: Delayed Feedback Modeling with Influence Functions
Abstract:
In online advertising under the cost-per-conversion (CPA) model, accurate conversion rate (CVR) prediction is crucial. A major challenge is delayed feedback, where conversions may occur long after user interactions, leading to incomplete recent data and biased model training. Existing solutions partially mitigate this issue but often rely on auxiliary models, making them computationally inefficient and less adaptive to user interest shifts. We propose IF-DFM, an \underline{I}nfluence \underline{F}unction-empowered for \underline{D}elayed \underline{F}eedback \underline{M}odeling which estimates the impact of newly arrived and delayed conversions on model parameters, enabling efficient updates without full retraining. By reformulating the inverse Hessian-vector product as an optimization problem, IF-DFM achieves a favorable trade-off between scalability and effectiveness. Experiments on benchmark datasets show that IF-DFM outperforms prior methods in both accuracy and adaptability.
中文: IF-DFM是一种利用影响函数估计延迟转化对模型参数影响的新方法,无需完整重训练即可高效更新转化率预测模型,在准确性和适应性上均优于现有方法。
English: IF-DFM is a novel method that uses influence functions to efficiently update conversion rate prediction models by estimating the impact of delayed conversions, achieving superior accuracy and adaptability without full retraining.

Authors:Yaxuan Kong, Yiyuan Yang, Shiyu Wang, Chenghao Liu, Yuxuan Liang, Ming Jin, Stefan Zohren, Dan Pei, Yan Liu, Qingsong Wen
Title: Position: Empowering Time Series Reasoning with Multimodal LLMs
Abstract:
Understanding time series data is crucial for multiple real-world applications. While large language models (LLMs) show promise in time series tasks, current approaches often rely on numerical data alone, overlooking the multimodal nature of time-dependent information, such as textual descriptions, visual data, and audio signals. Moreover, these methods underutilize LLMs' reasoning capabilities, limiting the analysis to surface-level interpretations instead of deeper temporal and multimodal reasoning. In this position paper, we argue that multimodal LLMs (MLLMs) can enable more powerful and flexible reasoning for time series analysis, enhancing decision-making and real-world applications. We call on researchers and practitioners to leverage this potential by developing strategies that prioritize trust, interpretability, and robust reasoning in MLLMs. Lastly, we highlight key research directions, including novel reasoning paradigms, architectural innovations, and domain-specific applications, to advance time series reasoning with MLLMs.
Chinese: 本立场文件主张利用多模态大语言模型整合多样化数据和高级推理以改进时间序列分析,并呼吁研究关注可信性、可解释性及创新应用方向。
English: This position paper advocates for using multimodal large language models (MLLMs) to enhance time series analysis by integrating diverse data types and advanced reasoning, while urging research focus on trust, interpretability, and innovative applications.

Authors:Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn
Title: Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Abstract:
Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. Videos are available at https://www.pi.website/research/hirobot
中文: 本研究提出一种采用视觉语言模型的分层系统,使通用机器人能够解析复杂指令与实时反馈以执行任务,并在清洁、备餐等多项任务中通过多机器人平台验证了其有效性。
English: This research presents a hierarchical system using vision-language models that enables generalist robots to interpret complex instructions and feedback for task execution, demonstrating effectiveness across multiple robotic platforms in tasks like cleaning and food preparation.

Authors:Haoyang Li, Li Bai, Qingqing Ye, Haibo Hu, Yaxin Xiao, Huadi Zheng, Jianliang Xu
Title: A Sample-Level Evaluation and Generative Framework for Model Inversion Attacks
Abstract:
Model Inversion (MI) attacks, which reconstruct the training dataset of neural networks, pose significant privacy concerns in machine learning. Recent MI attacks have managed to reconstruct realistic label-level private data, such as the general appearance of a target person from all training images labeled on him. Beyond label-level privacy, in this paper we show sample-level privacy, the private information of a single target sample, is also important but under-explored in the MI literature due to the limitations of existing evaluation metrics. To address this gap, this study introduces a novel metric tailored for training-sample analysis, namely, the Diversity and Distance Composite Score (DDCS), which evaluates the reconstruction fidelity of each training sample by encompassing various MI attack attributes. This, in turn, enhances the precision of sample-level privacy assessments. Leveraging DDCS as a new evaluative lens, we observe that many training samples remain resilient against even the most advanced MI attack. As such, we further propose a transfer learning framework that augments the generative capabilities of MI attackers through the integration of entropy loss and natural gradient descent. Extensive experiments verify the effectiveness of our framework on improving state-of-the-art MI attacks over various metrics including DDCS, coverage and FID. Finally, we demonstrate that DDCS can also be useful for MI defense, by identifying samples susceptible to MI attacks in an unsupervised manner.
中文摘要:本文提出了一种新的评估指标DDCS,用于衡量模型反演攻击中的样本级隐私风险,并通过引入迁移学习框架提升攻击效果,同时展示了DDCS在无监督防御中的实用价值。
English Summary: This paper introduces a novel evaluation metric called DDCS to assess sample-level privacy risks in model inversion attacks and proposes a transfer learning framework to enhance attack effectiveness, while also demonstrating DDCS's utility for unsupervised defense.

Authors:Weilin Chen, Ruichu Cai, Yuguang Yan, Zhifeng Hao, José Miguel Hernández-Lobato
Title: Long-term Causal Inference via Modeling Sequential Latent Confounding
Abstract:
Long-term causal inference is an important but challenging problem across various scientific domains. To solve the latent confounding problem in long-term observational studies, existing methods leverage short-term experimental data. Ghassami et al. propose an approach based on the Conditional Additive Equi-Confounding Bias (CAECB) assumption, which asserts that the confounding bias in the short-term outcome is equal to that in the long-term outcome, so that the long-term confounding bias and the causal effects can be identified. While effective in certain cases, this assumption is limited to scenarios where there is only one short-term outcome with the same scale as the long-term outcome. In this paper, we introduce a novel assumption that extends the CAECB assumption to accommodate temporal short-term outcomes. Our proposed assumption states a functional relationship between sequential confounding biases across temporal short-term outcomes, under which we theoretically establish the identification of long-term causal effects. Based on the identification result, we develop an estimator and conduct a theoretical analysis of its asymptotic properties. Extensive experiments validate our theoretical results and demonstrate the effectiveness of the proposed method.
中文摘要:本文提出了一种新假设,扩展了条件可加等混杂偏倚以处理时序短期结果,从而在理论保证和实验验证的基础上实现了长期因果效应的识别与估计。
English Summary: This paper introduces a novel assumption that extends the Conditional Additive Equi-Confounding Bias to handle temporal short-term outcomes, enabling the identification and estimation of long-term causal effects with theoretical guarantees and experimental validation.

Authors:Valeria Pantè, David Axelrod, Alessandro Flammini, Filippo Menczer, Emilio Ferrara, Luca Luceri
Title: Beyond Interaction Patterns: Assessing Claims of Coordinated Inter-State Information Operations on Twitter/X
Abstract:
Social media platforms have become key tools for coordinated influence operations, enabling state actors to manipulate public opinion through strategic, collective actions. While previous research has suggested collaboration between states, such research failed to leverage state-of-the-art coordination indicators or control datasets. In this study, we investigate inter-state coordination by analyzing multiple online behavioral traces and using sophisticated coordination detection models. By incorporating a control dataset to differentiate organic user activity from coordinated efforts, our findings reveal no evidence of inter-state coordination. These results challenge earlier claims and underscore the importance of robust methodologies and control datasets in accurately detecting online coordination.
中文: 本研究通过采用先进的协调检测模型和控制数据集,反驳了先前关于社交媒体影响行动中存在国家间协调的说法,未发现合作证据,并强调了可靠方法的重要性。
English: This study refutes prior claims of inter-state coordination in social media influence operations by employing advanced coordination detection models and control datasets, revealing no collaborative evidence and emphasizing the necessity of robust methodologies.

Authors:Yedong Shen, Xinran Zhang, Yifan Duan, Shiqi Zhang, Heng Li, Yilong Wu, Jianmin Ji, Yanyong Zhang
Title: OG-Gaussian: Occupancy Based Street Gaussians for Autonomous Driving
Abstract:
Accurate and realistic 3D scene reconstruction enables the lifelike creation of autonomous driving simulation environments. With advancements in 3D Gaussian Splatting (3DGS), previous studies have applied it to reconstruct complex dynamic driving scenes. These methods typically require expensive LiDAR sensors and pre-annotated datasets of dynamic objects. To address these challenges, we propose OG-Gaussian, a novel approach that replaces LiDAR point clouds with Occupancy Grids (OGs) generated from surround-view camera images using Occupancy Prediction Network (ONet). Our method leverages the semantic information in OGs to separate dynamic vehicles from static street background, converting these grids into two distinct sets of initial point clouds for reconstructing both static and dynamic objects. Additionally, we estimate the trajectories and poses of dynamic objects through a learning-based approach, eliminating the need for complex manual annotations. Experiments on Waymo Open dataset demonstrate that OG-Gaussian is on par with the current state-of-the-art in terms of reconstruction quality and rendering speed, achieving an average PSNR of 35.13 and a rendering speed of 143 FPS, while significantly reducing computational costs and economic overhead.
中文: OG-Gaussian提出了一种创新方法,通过使用摄像头图像生成的占据栅格替代激光雷达,无需昂贵传感器或标注即可有效分离静态与动态元素,在保持顶尖重建质量和渲染速度的同时显著降低成本。
English: OG-Gaussian introduces a novel method for reconstructing dynamic driving scenes by replacing LiDAR with occupancy grids from camera images, enabling efficient separation of static and dynamic elements without costly sensors or annotations while matching state-of-the-art performance.

Authors:Ruichu Cai, Haiqin Huang, Zhifang Jiang, Zijian Li, Changze Zhou, Yuequn Liu, Yuming Liu, Zhifeng Hao
Title: Disentangling Long-Short Term State Under Unknown Interventions for Online Time Series Forecasting
Abstract:
Current methods for time series forecasting struggle in the online scenario, since it is difficult to preserve long-term dependency while adapting short-term changes when data are arriving sequentially. Although some recent methods solve this problem by controlling the updates of latent states, they cannot disentangle the long/short-term states, leading to the inability to effectively adapt to nonstationary. To tackle this challenge, we propose a general framework to disentangle long/short-term states for online time series forecasting. Our idea is inspired by the observations where short-term changes can be led by unknown interventions like abrupt policies in the stock market. Based on this insight, we formalize a data generation process with unknown interventions on short-term states. Under mild assumptions, we further leverage the independence of short-term states led by unknown interventions to establish the identification theory to achieve the disentanglement of long/short-term states. Built on this theory, we develop a long short-term disentanglement model (LSTD) to extract the long/short-term states with long/short-term encoders, respectively. Furthermore, the LSTD model incorporates a smooth constraint to preserve the long-term dependencies and an interrupted dependency constraint to enforce the forgetting of short-term dependencies, together boosting the disentanglement of long/short-term states. Experimental results on several benchmark datasets show that our \textbf{LSTD} model outperforms existing methods for online time series forecasting, validating its efficacy in real-world applications.
中文摘要:本文提出的LSTD框架通过建模未知干预,有效分离了在线时间序列预测中的长短期状态,在保持长期依赖的同时增强了对非平稳数据的适应能力。
English Summary: The proposed LSTD framework effectively disentangles long-term and short-term states in online time series forecasting by modeling unknown interventions, enabling better adaptation to nonstationary data while preserving long-term dependencies.

Authors:Zhiyi Chen, Jinyi Ye, Beverlyn Tsai, Emilio Ferrara, Luca Luceri
Title: Synthetic Politics: Prevalence, Spreaders, and Emotional Reception of AI-Generated Political Images on X
Abstract:
Despite widespread concerns about the risks of AI-generated content (AIGC) to the integrity of social media discourse, little is known about its scale and scope, the actors responsible for its dissemination online, and the user responses it elicits. In this work, we measure and characterize the prevalence, spreaders, and emotional reception of AI-generated political images. Analyzing a large-scale dataset from Twitter/X related to the 2024 U.S. Presidential Election, we find that approximately 12% of shared images are detected as AI-generated, and around 10% of users are responsible for sharing 80% of AI-generated images. AIGC superspreaders--defined as the users who not only share a high volume of AI-generated images but also receive substantial engagement through retweets--are more likely to be X Premium subscribers, have a right-leaning orientation, and exhibit automated behavior. Their profiles contain a higher proportion of AI-generated images than non-superspreaders, and some engage in extreme levels of AIGC sharing. Moreover, superspreaders' AI image tweets elicit more positive and less toxic responses than their non-AI image tweets. This study serves as one of the first steps toward understanding the role generative AI plays in shaping online socio-political environments and offers implications for platform governance.
中文摘要:研究发现2024年美国总统大选期间约12%的分享图片为AI生成,主要由右倾的"超级传播者"集中扩散,这些AI图片获得更积极的用户反馈,揭示了生成式AI对网络政治环境的潜在影响。
English Summary: This study reveals that AI-generated political images constitute about 12% of shared content during the 2024 U.S. election, primarily disseminated by a small group of right-leaning superspreaders who receive more positive engagement, highlighting AI's growing influence on online political discourse.

Authors:Qingwen Lin, Boyan Xu, Guimin Hu, Zijian Li, Zhifeng Hao, Keli Zhang, Ruichu Cai
Title: CMCTS: A Constrained Monte Carlo Tree Search Framework for Mathematical Reasoning in Large Language Model
Abstract:
This paper introduces the Constrained Monte Carlo Tree Search (CMCTS) framework to enhance the mathematical reasoning capabilities of Large Language Models (LLM). By incorporating a constrained action space, Process Reward Model (PRM), and partial order rules, CMCTS effectively addresses the limitations of existing MCTS methods in terms of state space diversity and action selection rationality. Specifically, during the expansion phase, CMCTS restricts action sampling to a predefined constrained action set to increase candidate state diversity. In the simulation phase, it introduces partial order rules and PRM to optimize action selection and prevent unreasonable state transitions. Experimental results show that CMCTS performs outstandingly across multiple mathematical reasoning benchmarks. Under a zero-shot setting, a 7B-parameter model achieves an average accuracy of 83.4\%, surpassing the 72B baseline model by 4.8\%. Ablation studies demonstrate that each component of the framework is crucial for performance improvement, and their combined use fully leverages their respective strengths. Overall, the CMCTS framework provides an effective approach to enhancing LLM mathematical reasoning capabilities, supported by theoretical analysis, and offers novel insights for future reasoning tasks.
CMCTS框架通过引入约束行动空间、过程奖励模型和偏序规则,有效提升大语言模型的数学推理能力,实验表明7B参数模型在零样本设定下以83.4%的平均准确率超越72B基线模型4.8%。
The CMCTS framework enhances LLM mathematical reasoning by integrating constrained action spaces, process reward models, and partial order rules, achieving superior performance with an 83.4% accuracy using a 7B model that surpasses larger baselines.

Authors:Jinouwen Zhang, Junjie Ren, Aobo Yang, Yan Lu, Lu Chen, Hairun Xie, Jing Wang, Miao Zhang, Wanli Ouyang, Shixiang Tang
Title: FuncGenFoil: Airfoil Generation and Editing Model in Function Space
Abstract:
Aircraft manufacturing is the jewel in the crown of industry, in which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. Existing deep learning methods, which typically rely on predefined parametric representations (e.g., Bézier) or discrete point sets, face an inherent trade-off between expressive power and resolution adaptability. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly reconstructs airfoil geometries as function curves. Our method inherits the advantages of arbitrary-resolution sampling and smoothness from parametric functions, as well as the strong expressiveness of discrete point-based representations. Empirical evaluations demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation, achieving a relative 74.4% reduction in label error and a 23.2% increase in diversity on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design.
中文: FuncGenFoil提出了一种函数空间生成模型,通过融合参数函数的平滑性、分辨率适应性与离散表示的强大表达能力,克服了现有方法的局限,在翼型设计中显著提升了精度和多样性。
English: FuncGenFoil introduces a function-space generative model that overcomes the limitations of existing methods by combining the smoothness and resolution adaptability of parametric functions with the expressiveness of discrete representations, achieving significant improvements in accuracy and diversity for airfoil design.

Authors:Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan
Title: MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Abstract:
Despite notable advancements in Multimodal Large Language Models (MLLMs), most state-of-the-art models have not undergone thorough alignment with human preferences. This gap exists because current alignment research has primarily achieved progress in specific areas (e.g., hallucination reduction), while the broader question of whether aligning models with human preferences can systematically enhance MLLM capability remains largely unexplored. To this end, we introduce MM-RLHF, a dataset containing $\mathbf{120k}$ fine-grained, human-annotated preference comparison pairs. This dataset represents a substantial advancement over existing resources, offering superior size, diversity, annotation granularity, and quality. Leveraging this dataset, we propose several key innovations to improve both the quality of reward models and the efficiency of alignment algorithms. Notably, we introduce a Critique-Based Reward Model, which generates critiques of model outputs before assigning scores, offering enhanced interpretability and more informative feedback compared to traditional scalar reward mechanisms. Additionally, we propose Dynamic Reward Scaling, a method that adjusts the loss weight of each sample according to the reward signal, thereby optimizing the use of high-quality comparison pairs. Our approach is rigorously evaluated across $\mathbf{10}$ distinct dimensions and $\mathbf{27}$ benchmarks, with results demonstrating significant and consistent improvements in model performance. Specifically, fine-tuning LLaVA-ov-7B with MM-RLHF and our alignment algorithm leads to a $\mathbf{19.5}$% increase in conversational abilities and a $\mathbf{60}$% improvement in safety. We have open-sourced the preference dataset, reward model, training and evaluation code, as well as reward modeling and safety benchmarks. For more details, please visit our project page: https://mm-rlhf.github.io.
中文: 当前多模态大语言模型缺乏全面的人类偏好对齐,因此我们推出包含12万高质量偏好数据的MM-RLHF数据集及创新对齐方法,显著提升了模型性能与安全性。
English: Current multimodal large language models lack comprehensive human preference alignment, so we introduce MM-RLHF—a 120k high-quality preference dataset with innovative alignment techniques that significantly boost model performance and safety.

Authors:Ang Li, Yin Zhou, Vethavikashini Chithrra Raghuram, Tom Goldstein, Micah Goldblum
Title: Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks
Abstract:
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs). These attacks may extract private information or coerce the model into producing harmful outputs. In real-world deployments, LLMs are often part of a larger agentic pipeline including memory systems, retrieval, web access, and API calling. Such additional components introduce vulnerabilities that make these LLM-powered agents much easier to attack than isolated LLMs, yet relatively little work focuses on the security of LLM agents. In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We first provide a taxonomy of attacks categorized by threat actors, objectives, entry points, attacker observability, attack strategies, and inherent vulnerabilities of agent pipelines. We then conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities. Notably, our attacks are trivial to implement and require no understanding of machine learning.
中文摘要:最新研究表明,集成记忆系统和API调用等组件的LLM智能体比独立模型更易遭受安全与隐私攻击,实验证明这些攻击无需机器学习知识即可轻易实施。
English Summary: Recent research highlights that LLM-powered agents, integrated with components like memory and API calls, are more vulnerable to security and privacy attacks than standalone models, with demonstrated attacks being simple to execute without ML expertise.

Authors:Shixiang Tang, Yizhou Wang, Lu Chen, Yuan Wang, Sida Peng, Dan Xu, Wanli Ouyang
Title: Human-Centric Foundation Models: Perception, Generation and Agentic Modeling
Abstract:
Human understanding and generation are critical for modeling digital humans and humanoid embodiments. Recently, Human-centric Foundation Models (HcFMs) inspired by the success of generalist models, such as large language and vision models, have emerged to unify diverse human-centric tasks into a single framework, surpassing traditional task-specific approaches. In this survey, we present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups: (1) Human-centric Perception Foundation Models that capture fine-grained features for multi-modal 2D and 3D understanding. (2) Human-centric AIGC Foundation Models that generate high-fidelity, diverse human-related content. (3) Unified Perception and Generation Models that integrate these capabilities to enhance both human understanding and synthesis. (4) Human-centric Agentic Foundation Models that extend beyond perception and generation to learn human-like intelligence and interactive behaviors for humanoid embodied tasks. We review state-of-the-art techniques, discuss emerging challenges and future research directions. This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.
中文摘要:以人为中心的基础模型(HcFMs)通过整合感知、生成与智能体能力,将多样化人本任务统一至单一框架,为构建更强大的数字人建模提供了超越传统方法的新范式。
English Summary: Human-centric Foundation Models (HcFMs) unify diverse human-centric tasks into a single framework, surpassing traditional approaches by integrating perception, generation, and agentic capabilities for robust digital human modeling.

Authors:Qianrui Teng, Xing Cui, Xuannan Liu, Peipei Li, Zekun Li, Huaibo Huang, Ran He
Title: ID-Cloak: Crafting Identity-Specific Cloaks Against Personalized Text-to-Image Generation
Abstract:
Personalized text-to-image models allow users to generate images of new concepts from several reference photos, thereby leading to critical concerns regarding civil privacy. Although several anti-personalization techniques have been developed, these methods typically assume that defenders can afford to design a privacy cloak corresponding to each specific image. However, due to extensive personal images shared online, image-specific methods are limited by real-world practical applications. To address this issue, we are the first to investigate the creation of identity-specific cloaks (ID-Cloak) that safeguard all images belong to a specific identity. Specifically, we first model an identity subspace that preserves personal commonalities and learns diverse contexts to capture the image distribution to be protected. Then, we craft identity-specific cloaks with the proposed novel objective that encourages the cloak to guide the model away from its normal output within the subspace. Extensive experiments show that the generated universal cloak can effectively protect the images. We believe our method, along with the proposed identity-specific cloak setting, marks a notable advance in realistic privacy protection.
中文: 本研究首次提出身份特定隐私保护方法ID-Cloak,通过构建身份子空间并引导模型偏离正常输出,能有效防止个性化文生图模型生成特定个人的可识别图像。
English: This study introduces ID-Cloak, the first identity-specific privacy protection method that creates universal cloaks to prevent personalized text-to-image models from generating recognizable images of an individual by learning their identity subspace and redirecting model outputs.

Authors:Song Wang, Zhen Tan, Yaochen Zhu, Chuxu Zhang, Jundong Li
Title: Generative Risk Minimization for Out-of-Distribution Generalization on Graphs
Abstract:
Out-of-distribution (OOD) generalization on graphs aims at dealing with scenarios where the test graph distribution differs from the training graph distributions. Compared to i.i.d. data like images, the OOD generalization problem on graph-structured data remains challenging due to the non-i.i.d. property and complex structural information on graphs. Recently, several works on graph OOD generalization have explored extracting invariant subgraphs that share crucial classification information across different distributions. Nevertheless, such a strategy could be suboptimal for entirely capturing the invariant information, as the extraction of discrete structures could potentially lead to the loss of invariant information or the involvement of spurious information. In this paper, we propose an innovative framework, named Generative Risk Minimization (GRM), designed to generate an invariant subgraph for each input graph to be classified, instead of extraction. To address the challenge of optimization in the absence of optimal invariant subgraphs (i.e., ground truths), we derive a tractable form of the proposed GRM objective by introducing a latent causal variable, and its effectiveness is validated by our theoretical analysis. We further conduct extensive experiments across a variety of real-world graph datasets for both node-level and graph-level OOD generalization, and the results demonstrate the superiority of our framework GRM.
中文: 本文提出生成风险最小化(GRM)框架,通过生成而非提取不变子图来提升图数据的分布外泛化能力,有效解决了现有方法的信息丢失问题,并通过理论分析和实验验证了其优越性。
English: This paper introduces Generative Risk Minimization (GRM), a novel framework that generates invariant subgraphs for graph classification to enhance out-of-distribution generalization, overcoming limitations of extraction-based methods and demonstrating effectiveness through theoretical analysis and experiments.

Authors:Sean McLeish, John Kirchenbauer, David Yu Miller, Siddharth Singh, Abhinav Bhatele, Micah Goldblum, Ashwinee Panda, Tom Goldstein
Title: Gemstones: A Model Suite for Multi-Faceted Scaling Laws
Abstract:
Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.
中文: 本研究通过纳入多样化的架构形状和超参数扩展了规模法则分析,揭示了它们对模型方案的显著影响,并发布了包含4000多个Transformer检查点的Gemstones数据集以供深入研究。
English: This research expands scaling law analysis by incorporating diverse architectural shapes and hyperparameters, revealing their significant impact on model prescriptions and releasing the Gemstones dataset with over 4000 transformer checkpoints for further study.

Authors:Sean McLeish, John Kirchenbauer, David Yu Miller, Siddharth Singh, Abhinav Bhatele, Micah Goldblum, Ashwinee Panda, Tom Goldstein
Title: Gemstones: A Model Suite for Multi-Faceted Scaling Laws
Abstract:
Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.
中文: 本研究通过纳入多样化的架构形状和超参数扩展了规模法则分析,揭示了它们对模型方案的显著影响,并发布了包含4000多个Transformer检查点的Gemstones数据集以供深入研究。
English: This research expands scaling law analysis by incorporating diverse architectural shapes and hyperparameters, revealing their significant impact on model prescriptions and releasing the Gemstones dataset with over 4000 transformer checkpoints for further study.

Authors:Houcheng Jiang, Junfeng Fang, Ningyu Zhang, Guojun Ma, Mingyang Wan, Xiang Wang, Xiangnan He, Tat-seng Chua
Title: AnyEdit: Edit Any Knowledge Encoded in Language Models
Abstract:
Large language models (LLMs) often produce incorrect or outdated information, necessitating efficient and precise knowledge updates. Current model editing methods, however, struggle with long-form knowledge in diverse formats, such as poetry, code snippets, and mathematical derivations. These limitations arise from their reliance on editing a single token's hidden state, a limitation we term "efficacy barrier". To solve this, we propose AnyEdit, a new autoregressive editing paradigm. It decomposes long-form knowledge into sequential chunks and iteratively edits the key token in each chunk, ensuring consistent and accurate outputs. Theoretically, we ground AnyEdit in the Chain Rule of Mutual Information, showing its ability to update any knowledge within LLMs. Empirically, it outperforms strong baselines by 21.5% on benchmarks including UnKEBench, AKEW, and our new EditEverything dataset for long-form diverse-formatted knowledge. Additionally, AnyEdit serves as a plug-and-play framework, enabling current editing methods to update knowledge with arbitrary length and format, significantly advancing the scope and practicality of LLM knowledge editing.
中文: 针对现有模型编辑方法在处理长文本和多样化格式知识方面的局限,我们提出AnyEdit这一自回归编辑范式,通过将知识分解为顺序块并迭代编辑关键标记,显著提升了性能,并作为即插即用框架实现了广泛适用性。
English: To address the limitations of current model editing methods in handling long-form and diverse-format knowledge, we propose AnyEdit, an autoregressive editing paradigm that decomposes knowledge into sequential chunks and iteratively edits key tokens, achieving significant performance improvements and broad applicability as a plug-and-play framework.

Authors:Yueying Zou, Peipei Li, Zekun Li, Huaibo Huang, Xing Cui, Xuannan Liu, Chenghanyu Zhang, Ran He
Title: Survey on AI-Generated Media Detection: From Non-MLLM to MLLM
Abstract:
The proliferation of AI-generated media poses significant challenges to information authenticity and social trust, making reliable detection methods highly demanded. Methods for detecting AI-generated media have evolved rapidly, paralleling the advancement of Multimodal Large Language Models (MLLMs). Current detection approaches can be categorized into two main groups: Non-MLLM-based and MLLM-based methods. The former employs high-precision, domain-specific detectors powered by deep learning techniques, while the latter utilizes general-purpose detectors based on MLLMs that integrate authenticity verification, explainability, and localization capabilities. Despite significant progress in this field, there remains a gap in literature regarding a comprehensive survey that examines the transition from domain-specific to general-purpose detection methods. This paper addresses this gap by providing a systematic review of both approaches, analyzing them from single-modal and multi-modal perspectives. We present a detailed comparative analysis of these categories, examining their methodological similarities and differences. Through this analysis, we explore potential hybrid approaches and identify key challenges in forgery detection, providing direction for future research. Additionally, as MLLMs become increasingly prevalent in detection tasks, ethical and security considerations have emerged as critical global concerns. We examine the regulatory landscape surrounding Generative AI (GenAI) across various jurisdictions, offering valuable insights for researchers and practitioners in this field.
中文: 本文系统综述了AI生成媒体的检测方法,对比专业深度学习检测器与通用多模态语言模型方法,同时探讨伦理问题及未来研究方向。
English: This paper provides a systematic review of AI-generated media detection methods, comparing specialized deep learning detectors with general-purpose multimodal language model approaches while addressing ethical concerns and future research directions.

Authors:Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge Lin, Weiyang Liu
Title: Generating Symbolic World Models via Test-time Scaling of Large Language Models
Abstract:
Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domains, achieving over 50\% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.
中文摘要:针对大语言模型因缺乏PDDL训练数据而难以生成精确规划领域定义语言的挑战,本文提出一种无需额外训练、结合最佳N采样和语言化机器学习的方法,通过增强测试时计算能力显著提升了PDDL领域生成质量与规划任务性能。
English Summary: Large Language Models struggle with generating precise Planning Domain Definition Language (PDDL) domains due to training data scarcity, but a new test-time computation method using Best-of-N sampling and verbalized machine learning significantly enhances PDDL reasoning capabilities without additional training.

Authors:Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
Title: Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems
Abstract:
We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with topological message passing for collaborative generation. Given a pool of LLM experts and a utility function, Heterogeneous Swarms employs two iterative steps: role-step and weight-step. For role-step, we interpret model roles as learning a DAG that specifies the flow of inputs and outputs between LLMs. Starting from a swarm of random continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs in topological order, evaluate on the utility function (e.g. accuracy on a task), and optimize the adjacency matrices with particle swarm optimization based on the utility score. For weight-step, we assess the contribution of individual LLMs in the multi-LLM systems and optimize model weights with swarm intelligence. We propose JFK-score to quantify the individual contribution of each LLM in the best-found DAG of the role-step, then optimize model weights with particle swarm optimization based on the JFK-score. Experiments demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based baselines by 18.5% on average across 12 tasks. Further analysis reveals that Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles and substantial collaborative gains, and benefits from the diversity of language models.
Chinese: 异构群算法通过迭代的角色与权重步骤联合优化多LLM系统中的模型角色和权重,在12项任务中平均超越基线18.5%,利用模型多样性实现显著协同增益。
English: Heterogeneous Swarms is an algorithm that designs multi-LLM systems by jointly optimizing model roles and weights through iterative role and weight steps, outperforming baselines by 18.5% across 12 tasks and leveraging model diversity for collaborative gains.

Authors:Shangbin Feng, Wenxuan Ding, Alisa Liu, Zifeng Wang, Weijia Shi, Yike Wang, Zejiang Shen, Xiaochuang Han, Hunter Lang, Chen-Yu Lee, Tomas Pfister, Yejin Choi, Yulia Tsvetkov
Title: When One LLM Drools, Multi-LLM Collaboration Rules
Abstract:
This position paper argues that in many realistic (i.e., complex, contextualized, subjective) scenarios, one LLM is not enough to produce a reliable output. We challenge the status quo of relying solely on a single general-purpose LLM and argue for multi-LLM collaboration to better represent the extensive diversity of data, skills, and people. We first posit that a single LLM underrepresents real-world data distributions, heterogeneous skills, and pluralistic populations, and that such representation gaps cannot be trivially patched by further training a single LLM. We then organize existing multi-LLM collaboration methods into a hierarchy, based on the level of access and information exchange, ranging from API-level, text-level, logit-level, to weight-level collaboration. Based on these methods, we highlight how multi-LLM collaboration addresses challenges that a single LLM struggles with, such as reliability, democratization, and pluralism. Finally, we identify the limitations of existing multi-LLM methods and motivate future work. We envision multi-LLM collaboration as an essential path toward compositional intelligence and collaborative AI development.
中文摘要:本文主张采用多大型语言模型协作替代单一模型,以更有效地应对复杂现实场景,通过弥补数据和技能的表征差距来提升可靠性、民主化与多元性。
English Summary: This paper advocates for multi-LLM collaboration over single-model approaches to better handle complex scenarios, arguing it improves reliability, democratization, and pluralism by bridging representation gaps in data and skills.

Authors:Junhao Song, Yichao Zhang, Ziqian Bi, Tianyang Wang, Keyu Chen, Ming Li, Qian Niu, Junyu Liu, Benji Peng, Sen Zhang, Ming Liu, Jiawei Xu, Xuanhe Pan, Jinlang Wang, Pohsun Feng, Yizhu Wen, Lawrence K. Q. Yan, Hong-Ming Tseng, Xinyuan Song, Jintao Ren, Silin Chen, Yunze Wang, Weiche Hsieh, Bowen Jing, Junjie Yang, Jun Zhou, Zheyu Yao, Chia Xin Liang
Title: Generative Adversarial Networks Bridging Art and Machine Intelligence
Abstract:
Generative Adversarial Networks (GAN) have greatly influenced the development of computer vision and artificial intelligence in the past decade and also connected art and machine intelligence together. This book begins with a detailed introduction to the fundamental principles and historical development of GANs, contrasting them with traditional generative models and elucidating the core adversarial mechanisms through illustrative Python examples. The text systematically addresses the mathematical and theoretical underpinnings including probability theory, statistics, and game theory providing a solid framework for understanding the objectives, loss functions, and optimisation challenges inherent to GAN training. Subsequent chapters review classic variants such as Conditional GANs, DCGANs, InfoGAN, and LAPGAN before progressing to advanced training methodologies like Wasserstein GANs, GANs with gradient penalty, least squares GANs, and spectral normalisation techniques. The book further examines architectural enhancements and task-specific adaptations in generators and discriminators, showcasing practical implementations in high resolution image generation, artistic style transfer, video synthesis, text to image generation and other multimedia applications. The concluding sections offer insights into emerging research trends, including self-attention mechanisms, transformer-based generative models, and a comparative analysis with diffusion models, thus charting promising directions for future developments in both academic and applied settings.
中文: 本书系统阐述了GAN的基本原理、数学理论、经典变体与先进训练方法,并探讨了实际应用场景,最后展望了包括自注意力机制和扩散模型在内的未来研究方向。
English: This book comprehensively covers GAN fundamentals, mathematical theories, classic variants, advanced training methods, and practical applications while concluding with emerging research trends and future directions in the field.

Authors:Tianyang Wang, Silin Chen, Yunze Wang, Yichao Zhang, Xinyuan Song, Ziqian Bi, Ming Liu, Qian Niu, Junyu Liu, Pohsun Feng, Xintian Sun, Benji Peng, Charles Zhang, Keyu Chen, Ming Li, Cheng Fei, Lawrence KQ Yan
Title: From In Silico to In Vitro: A Comprehensive Guide to Validating Bioinformatics Findings
Abstract:
The integration of bioinformatics predictions and experimental validation plays a pivotal role in advancing biological research, from understanding molecular mechanisms to developing therapeutic strategies. Bioinformatics tools and methods offer powerful means for predicting gene functions, protein interactions, and regulatory networks, but these predictions must be validated through experimental approaches to ensure their biological relevance. This review explores the various methods and technologies used for experimental validation, including gene expression analysis, protein-protein interaction verification, and pathway validation. We also discuss the challenges involved in translating computational predictions to experimental settings and highlight the importance of collaboration between bioinformatics and experimental research. Finally, emerging technologies, such as CRISPR gene editing, next-generation sequencing, and artificial intelligence, are shaping the future of bioinformatics validation and driving more accurate and efficient biological discoveries.
中文: 生物信息学预测与实验验证的协同作用对生物学研究至关重要,而CRISPR和人工智能等新兴技术正提高发现的准确性和效率。
English: The synergy of bioinformatics predictions and experimental validation is crucial for biological research, with emerging technologies like CRISPR and AI enhancing the accuracy and efficiency of discoveries.

Authors:Fan Lyu, Hanyu Zhao, Ziqi Shi, Ye Liu, Fuyuan Hu, Zhang Zhang, Liang Wang
Title: Conformal Uncertainty Indicator for Continual Test-Time Adaptation
Abstract:
Continual Test-Time Adaptation (CTTA) aims to adapt models to sequentially changing domains during testing, relying on pseudo-labels for self-adaptation. However, incorrect pseudo-labels can accumulate, leading to performance degradation. To address this, we propose a Conformal Uncertainty Indicator (CUI) for CTTA, leveraging Conformal Prediction (CP) to generate prediction sets that include the true label with a specified coverage probability. Since domain shifts can lower the coverage than expected, making CP unreliable, we dynamically compensate for the coverage by measuring both domain and data differences. Reliable pseudo-labels from CP are then selectively utilized to enhance adaptation. Experiments confirm that CUI effectively estimates uncertainty and improves adaptation performance across various existing CTTA methods.
中文: 提出的保形不确定性指标通过动态补偿保形预测的覆盖概率来生成可靠的伪标签,有效减少错误累积并提升持续测试时自适应方法的性能。
English: The proposed Conformal Uncertainty Indicator (CUI) dynamically compensates for coverage probability in Conformal Prediction to generate reliable pseudo-labels, effectively mitigating error accumulation and improving adaptation performance across Continual Test-Time Adaptation methods.

Authors:Jixun Yao, Hexin Liu, Chen Chen, Yuchen Hu, EngSiong Chng, Lei Xie
Title: GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling
Abstract:
Semantic information refers to the meaning conveyed through words, phrases, and contextual relationships within a given linguistic structure. Humans can leverage semantic information, such as familiar linguistic patterns and contextual cues, to reconstruct incomplete or masked speech signals in noisy environments. However, existing speech enhancement (SE) approaches often overlook the rich semantic information embedded in speech, which is crucial for improving intelligibility, speaker consistency, and overall quality of enhanced speech signals. To enrich the SE model with semantic information, we employ language models as an efficient semantic learner and propose a comprehensive framework tailored for language model-based speech enhancement, called \textit{GenSE}. Specifically, we approach SE as a conditional language modeling task rather than a continuous signal regression problem defined in existing works. This is achieved by tokenizing speech signals into semantic tokens using a pre-trained self-supervised model and into acoustic tokens using a custom-designed single-quantizer neural codec model. To improve the stability of language model predictions, we propose a hierarchical modeling method that decouples the generation of clean semantic tokens and clean acoustic tokens into two distinct stages. Moreover, we introduce a token chain prompting mechanism during the acoustic token generation stage to ensure timbre consistency throughout the speech enhancement process. Experimental results on benchmark datasets demonstrate that our proposed approach outperforms state-of-the-art SE systems in terms of speech quality and generalization capability.
中文: 提出的GenSE框架将语音增强视为条件语言建模任务,通过语义和声学标记的分层生成及提示机制,在语音质量和泛化能力上超越了现有先进系统。
English: The proposed GenSE framework leverages language models to treat speech enhancement as a conditional language modeling task, utilizing semantic and acoustic tokens to improve speech quality and generalization beyond traditional methods.

Authors:Jinda Lu, Junkang Wu, Jinghan Li, Xiaojun Jia, Shuo Wang, YiFan Zhang, Junfeng Fang, Xiang Wang, Xiangnan He
Title: DAMA: Data- and Model-aware Alignment of Multi-modal LLMs
Abstract:
Direct Preference Optimization (DPO) has shown effectiveness in aligning multi-modal large language models (MLLM) with human preferences. However, existing methods exhibit an imbalanced responsiveness to the data of varying hardness, tending to overfit on the easy-to-distinguish data while underfitting on the hard-to-distinguish data. In this paper, we propose Data- and Model-aware DPO (DAMA) to dynamically adjust the optimization process from two key aspects: (1) a data-aware strategy that incorporates data hardness, and (2) a model-aware strategy that integrates real-time model responses. By combining the two strategies, DAMA enables the model to effectively adapt to data with varying levels of hardness. Extensive experiments on five benchmarks demonstrate that DAMA not only significantly enhances the trustworthiness, but also improves the effectiveness over general tasks. For instance, on the Object-HalBench, our DAMA-7B reduces response-level and mentioned-level hallucination by 90.0% and 95.3%, respectively, surpassing the performance of GPT-4V.
中文: DPO在将多模态大语言模型与人类偏好对齐方面有效,但难以均衡处理不同难度的数据,因此提出DAMA方法,通过数据感知和模型感知策略动态优化,显著提升任务可信度和效果。
English: DPO effectively aligns multi-modal large language models with human preferences but struggles with data of varying hardness, leading to the proposed DAMA method that dynamically adjusts optimization using data-aware and model-aware strategies to enhance trustworthiness and effectiveness across tasks.

Authors:Angelo Garofalo, Alessandro Ottaviano, Matteo Perotti, Thomas Benz, Yvan Tortorella, Robert Balas, Michael Rogenmoser, Chi Zhang, Luca Bertaccini, Nils Wistoff, Maicol Ciani, Cyril Koenig, Mattia Sinigaglia, Luca Valente, Paul Scheffler, Manuel Eggimann, Matheus Cavalcante, Francesco Restuccia, Alessandro Biondi, Francesco Conti, Frank K. Gurkaynak, Davide Rossi, Luca Benini
Title: A Reliable, Time-Predictable Heterogeneous SoC for AI-Enhanced Mixed-Criticality Edge Applications
Abstract:
Next-generation mixed-criticality Systems-on-chip (SoCs) for robotics, automotive, and space must execute mixed-criticality AI-enhanced sensor processing and control workloads, ensuring reliable and time-predictable execution of critical tasks sharing resources with non-critical tasks, while also fitting within a sub-2W power envelope. To tackle these multi-dimensional challenges, in this brief, we present a 16nm, reliable, time-predictable heterogeneous SoC with multiple programmable accelerators. Within a 1.2W power envelope, the SoC integrates software-configurable hardware IPs to ensure predictable access to shared resources, such as the on-chip interconnect and memory system, leading to tight upper bounds on execution times of critical applications. To accelerate mixed-precision mission-critical AI, the SoC integrates a reliable multi-core accelerator achieving 304.9 GOPS peak performance at 1.6 TOPS/W energy efficiency. Non-critical, compute-intensive, floating-point workloads are accelerated by a dual-core vector cluster, achieving 121.8 GFLOPS at 1.1 TFLOPS/W and 106.8 GFLOPS/mm2.
中文: 本文介绍了一款16纳米异构SoC,在1.2瓦功耗下通过可编程加速器实现关键AI任务(304.9 GOPS)与浮点运算(121.8 GFLOPS)的可靠可预测执行,专为机器人、汽车和航天领域的混合关键性系统设计。
English: This brief introduces a 16nm heterogeneous SoC that ensures reliable, time-predictable execution of mixed-criticality AI workloads within a 1.2W power envelope, featuring programmable accelerators for both mixed-precision AI (304.9 GOPS) and floating-point computations (121.8 GFLOPS).

Authors:Jacob Fein-Ashley, Neelesh Gupta, Rajgopal Kannan, Viktor Prasanna
Title: SPECTRE: An FFT-Based Efficient Drop-In Replacement to Self-Attention for Long Contexts
Abstract:
Long-context transformers face significant efficiency challenges due to the quadratic cost of self-attention. However, many modern applications-from multi-turn dialogue to high-resolution vision-require contexts spanning tens of thousands of tokens. We introduce SPECTRE, a method that replaces each attention head with a fast real FFT, a content-adaptive spectral gate, and an inverse FFT, reducing per-layer complexity from $\mathcal{O}(L^{2})$ to $O(L\log L)$ while preserving the surrounding architecture. We extend this efficiency to autoregressive generation through our Prefix-FFT cache and enhance local feature representation with an optional wavelet module that adds negligible computational overhead. Our experiments demonstrate that SPECTRE operates up to 7$\times$ faster than FlashAttention-2 on 128k-token contexts while matching or exceeding baseline performance on PG-19 language modeling and ImageNet-1k classification tasks. SPECTRE achieves these improvements by adding fewer than 6\% parameters to the base model, making hundred-kilotoken context processing feasible on commodity GPUs without specialized hardware.
Chinese: SPECTRE 提出了一种创新方法,通过快速傅里叶变换和自适应频谱门控将自注意力机制的二次复杂度降至近似线性,在仅增加不到6%参数的情况下,实现了对长上下文处理速度提升高达7倍且性能相当。
English: SPECTRE introduces a novel method using fast Fourier transforms and spectral gating to reduce the quadratic complexity of self-attention in transformers to nearly linear, enabling up to 7x faster processing of long contexts with minimal parameter increase while maintaining performance.

Authors:Linzhuang Sun, Hao Liang, Jingxuan Wei, Bihui Yu, Tianpeng Li, Fan Yang, Zenan Zhou, Wentao Zhang
Title: MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification
Abstract:
According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.
中文: 本文提出MM-Verifier和MM-Reasoner,通过增强验证和推理来提升多模态推理能力,在MathVista等基准测试中达到了领先性能。
English: This paper introduces MM-Verifier and MM-Reasoner to enhance multimodal reasoning through improved verification and inference, achieving state-of-the-art performance on benchmarks like MathVista.

Authors:Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
Title: Magma: A Foundation Model for Multimodal AI Agents
Abstract:
We present Magma, a foundation model that serves multimodal AI agentic tasks in both the digital and physical worlds. Magma is a significant extension of vision-language (VL) models in that it not only retains the VL understanding ability (verbal intelligence) of the latter, but is also equipped with the ability to plan and act in the visual-spatial world (spatial-temporal intelligence) and complete agentic tasks ranging from UI navigation to robot manipulation. To endow the agentic capabilities, Magma is pretrained on large amounts of heterogeneous datasets spanning from images, videos to robotics data, where the actionable visual objects (e.g., clickable buttons in GUI) in images are labeled by Set-of-Mark (SoM) for action grounding, and the object movements (e.g., the trace of human hands or robotic arms) in videos are labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show that SoM and ToM reach great synergy and facilitate the acquisition of spatial-temporal intelligence for our Magma model, which is fundamental to a wide range of tasks as shown in Fig.1. In particular, Magma creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are specifically tailored to these tasks. On image and video-related multimodal tasks, Magma also compares favorably to popular large multimodal models that are trained on much larger datasets. We make our model and code public for reproducibility at https://microsoft.github.io/Magma.
中文: Magma是一个多模态基础模型,不仅具备视觉语言理解能力,还通过空间-时间智能在数字和物理世界中规划执行任务,在界面导航和机器人操控等任务中创造了最新最优性能。
English: Magma is a multimodal foundation model that extends vision-language capabilities by integrating spatial-temporal intelligence for planning and executing agentic tasks in both digital and physical environments, achieving state-of-the-art performance in UI navigation and robotic manipulation.

Authors:Lin-Han Jia, Si-Yu Han, Lan-Zhe Guo, Zhi Zhou, Zhao-Long Li, Yu-Feng Li, Zhi-Hua Zhou
Title: A Smooth Transition Between Induction and Deduction: Fast Abductive Learning Based on Probabilistic Symbol Perception
Abstract:
Abductive learning (ABL) that integrates strengths of machine learning and logical reasoning to improve the learning generalization, has been recently shown effective. However, its efficiency is affected by the transition between numerical induction and symbolical deduction, leading to high computational costs in the worst-case scenario. Efforts on this issue remain to be limited. In this paper, we identified three reasons why previous optimization algorithms for ABL were not effective: insufficient utilization of prediction, symbol relationships, and accumulated experience in successful abductive processes, resulting in redundant calculations to the knowledge base. To address these challenges, we introduce an optimization algorithm named as Probabilistic Symbol Perception (PSP), which makes a smooth transition between induction and deduction and keeps the correctness of ABL unchanged. We leverage probability as a bridge and present an efficient data structure, achieving the transfer from a continuous probability sequence to discrete Boolean sequences with low computational complexity. Experiments demonstrate the promising results.
Chinese: 本文提出概率符号感知(PSP)优化算法,通过概率作为桥梁实现归纳与演绎间的平滑过渡,在保持溯因学习正确性的同时有效提升了其计算效率。
English: This paper introduces Probabilistic Symbol Perception (PSP), an optimization algorithm that enhances the efficiency of abductive learning by facilitating a smooth transition between numerical induction and logical deduction while maintaining correctness.

Authors:Elena Stringli, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou
Title: Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models
Abstract:
Inverse tasks can uncover potential reasoning gaps as Large Language Models (LLMs) scale up. In this work, we explore the redefinition task, in which we assign alternative values to well-known physical constants and units of measure, prompting LLMs to respond accordingly. Our findings show that not only does model performance degrade with scale, but its false confidence also rises. Moreover, while factors such as prompting strategies or response formatting are influential, they do not preclude LLMs from anchoring to memorized values.
中文摘要:随着大语言模型规模的扩大,在重新定义物理常数等逆向任务中,其错误置信度上升且性能下降,尽管提示策略和响应格式有所影响,模型仍固守记忆数值。
English Summary: As LLMs scale, they exhibit increased false confidence and performance decline in inverse tasks like redefining physical constants, despite being influenced by prompting strategies and formatting.

Authors:Yunhao Gou, Hansi Yang, Zhili Liu, Kai Chen, Yihan Zeng, Lanqing Hong, Zhenguo Li, Qun Liu, Bo Han, James T. Kwok, Yu Zhang
Title: Corrupted but Not Broken: Understanding and Mitigating the Negative Impacts of Corrupted Data in Visual Instruction Tuning
Abstract:
Visual Instruction Tuning (VIT) aims to enhance Multimodal Large Language Models (MLLMs), yet its effectiveness is often compromised by corrupted datasets with issues such as hallucinated content, incorrect responses, and poor OCR quality. Previous approaches to address these challenges have focused on refining datasets through high-quality data collection or rule-based filtering that can be costly or limited in scope. In this paper, we conduct a systematic investigation into the impact of corrupted data on MLLMs and discover that, although corrupted data degrade model performance, such adverse effects are largely reversible, and MLLMs are {\bf corrupted but not broken}. Specifically, we find that disabling a small subset of parameters can almost fully restore performance. Moreover, corrupted MLLMs inherently possess the capability to differentiate between clean and corrupted samples, facilitating dataset cleaning without external intervention. Building on these insights, we introduce a corruption-robust training paradigm that significantly surpasses existing strategies for mitigating the effects of corrupted data.
中文摘要:视觉指令调优常因数据污染而效果受限,但本研究发现多模态大语言模型虽受污染数据影响性能下降,这种损害可通过禁用少量参数完全恢复,且模型天生具备区分清洁与污染样本的能力,据此提出了超越现有方案的抗污染训练新范式。
English Summary: Visual Instruction Tuning's effectiveness is often hindered by corrupted datasets, but this study reveals that MLLMs' performance degradation from such data is reversible through targeted parameter adjustments and their inherent ability to distinguish clean from corrupted samples, leading to a new robust training method.

Authors:Shoukang Hu, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Takashi Shibuya, Yuki Mitsufuji
Title: HumanGif: Single-View Human Diffusion with Generative Prior
Abstract:
Previous 3D human creation methods have made significant progress in synthesizing view-consistent and temporally aligned results from sparse-view images or monocular videos. However, it remains challenging to produce perpetually realistic, view-consistent, and temporally coherent human avatars from a single image, as limited information is available in the single-view input setting. Motivated by the success of 2D character animation, we propose HumanGif, a single-view human diffusion model with generative prior. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process, utilizing generative priors from foundational diffusion models to complement the missing information. To ensure fine-grained and consistent novel view and pose synthesis, we introduce a Human NeRF module in HumanGif to learn spatially aligned features from the input image, implicitly capturing the relative camera and human pose transformation. Furthermore, we introduce an image-level loss during optimization to bridge the gap between latent and image spaces in diffusion models. Extensive experiments on RenderPeople, DNA-Rendering, THuman 2.1, and TikTok datasets demonstrate that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.
中文摘要:先前的人体三维生成方法难以从单张图像创建真实、视角一致且时序连贯的虚拟形象,为此提出的HumanGif模型通过扩散生成先验与神经辐射场模块,在新型视角与姿态合成中实现了最优的感知效果。
English Summary: Previous 3D human creation methods struggle with generating realistic, view-consistent, and temporally coherent avatars from a single image, leading to the development of HumanGif, which uses a diffusion model with generative priors and a Human NeRF module to achieve superior perceptual performance in novel view and pose synthesis.

Authors:Zhongyi Qiu, Hanjia Lyu, Wei Xiong, Jiebo Luo
Title: Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation
Abstract:
Social media enables dynamic user engagement with trending topics, and recent research has explored the potential of large language models (LLMs) for response generation. While some studies investigate LLMs as agents for simulating user behavior on social media, their focus remains on practical viability and scalability rather than a deeper understanding of how well LLM aligns with human behavior. This paper analyzes LLMs' ability to simulate social media engagement through action guided response generation, where a model first predicts a user's most likely engagement action-retweet, quote, or rewrite-towards a trending post before generating a personalized response conditioned on the predicted action. We benchmark GPT-4o-mini, O1-mini, and DeepSeek-R1 in social media engagement simulation regarding a major societal event discussed on X. Our findings reveal that zero-shot LLMs underperform BERT in action prediction, while few-shot prompting initially degrades the prediction accuracy of LLMs with limited examples. However, in response generation, few-shot LLMs achieve stronger semantic alignment with ground truth posts.
中文摘要:本文通过预测用户行为并生成回复来评估大语言模型模拟社交媒体参与的能力,发现尽管在行为预测上不如BERT模型,但少量样本学习能显著提升其生成回复与真实帖子的语义匹配度。
English Summary: This paper evaluates large language models' capability to simulate social media engagement by predicting user actions and generating responses, finding that while they underperform BERT in action prediction, few-shot learning improves their semantic alignment with human posts in response generation.

Authors:Ziyang Wu, Jingyuan Zhang, Druv Pai, XuDong Wang, Chandan Singh, Jianwei Yang, Jianfeng Gao, Yi Ma
Title: Simplifying DINO via Coding Rate Regularization
Abstract:
DINO and DINOv2 are two model families being widely used to learn representations from unlabeled imagery data at large scales. Their learned representations often enable state-of-the-art performance for downstream tasks, such as image classification and segmentation. However, they employ many empirically motivated design choices and their training pipelines are highly complex and unstable -- many hyperparameters need to be carefully tuned to ensure that the representations do not collapse -- which poses considerable difficulty to improving them or adapting them to new domains. In this work, we posit that we can remove most such-motivated idiosyncrasies in the pre-training pipelines, and only need to add an explicit coding rate term in the loss function to avoid collapse of the representations. As a result, we obtain highly simplified variants of the DINO and DINOv2 which we call SimDINO and SimDINOv2, respectively. Remarkably, these simplified models are more robust to different design choices, such as network architecture and hyperparameters, and they learn even higher-quality representations, measured by performance on downstream tasks, offering a Pareto improvement over the corresponding DINO and DINOv2 models. This work highlights the potential of using simplifying design principles to improve the empirical practice of deep learning.
中文: SimDINO和SimDINOv2是DINO和DINOv2的简化版本,通过引入编码率项防止表征崩溃,不仅更稳健且在下游任务中表现更优,实现了帕累托改进。
English: SimDINO and SimDINOv2 are simplified, more robust versions of DINO and DINOv2 that achieve higher-quality representations by adding a coding rate term to prevent collapse, offering a Pareto improvement in downstream tasks.

Authors:Ruichen Zhang, Mufan Qiu, Zhen Tan, Mohan Zhang, Vincent Lu, Jie Peng, Kaidi Xu, Leandro Z. Agudelo, Peter Qian, Tianlong Chen
Title: Symbiotic Cooperation for Web Agents: Harnessing Complementary Strengths of Large and Small LLMs
Abstract:
Web browsing agents powered by large language models (LLMs) have shown tremendous potential in automating complex web-based tasks. Existing approaches typically rely on large LLMs (e.g., GPT-4o) to explore web environments and generate trajectory data, which is then used either for demonstration retrieval (for large LLMs) or to distill small LLMs (e.g., Llama3) in a process that remains decoupled from the exploration. In this paper, we propose AgentSymbiotic, an iterative framework that couples data synthesis with task-performance, yielding a "symbiotic improvement" for both large and small LLMs. Our study uncovers a complementary dynamic between LLM types: while large LLMs excel at generating high-quality trajectories for distillation, the distilled small LLMs-owing to their distinct reasoning capabilities-often choose actions that diverge from those of their larger counterparts. This divergence drives the exploration of novel trajectories, thereby enriching the synthesized data. However, we also observe that the performance of small LLMs becomes a bottleneck in this iterative enhancement process. To address this, we propose two innovations in LLM distillation: a speculative data synthesis strategy that mitigates off-policy bias, and a multi-task learning approach designed to boost the reasoning capabilities of the student LLM. Furthermore, we introduce a Hybrid Mode for Privacy Preservation to address user privacy concerns. Evaluated on the WEBARENA benchmark, AgentSymbiotic achieves SOTA performance with both LLM types. Our best Large LLM agent reaches 52%, surpassing the previous best of 45%, while our 8B distilled model demonstrates a competitive 49%, exceeding the prior best of 28%. Code will be released upon acceptance.
Chinese: AgentSymbiotic提出了一种迭代框架,将数据合成与任务性能相结合,通过推测性数据合成和多任务学习实现大小语言模型的共生优化,在WEBARENA基准测试中取得了最先进的性能表现。
English: AgentSymbiotic is an iterative framework that synergizes data synthesis with task performance, enabling symbiotic improvement between large and small LLMs through speculative data synthesis and multi-task learning, achieving state-of-the-art results on the WEBARENA benchmark.

Authors:Qian Shao, Bang Du, Zepeng Li, Qiyuan Chen, Hongxia Xu, Jimeng Sun, Jian Wu, Jintai Chen
Title: Generation of Drug-Induced Cardiac Reactions towards Virtual Clinical Trials
Abstract:
Clinical trials remain critical in cardiac drug development but face high failure rates due to efficacy limitations and safety risks, incurring substantial costs. In-silico trial methodologies, particularly generative models simulating drug-induced electrocardiogram (ECG) alterations, offer a potential solution to mitigate these challenges. While existing models show progress in ECG synthesis, their constrained fidelity and inability to characterize individual-specific pharmacological response patterns fundamentally limit clinical translatability. To address these issues, we propose a novel Drug-Aware Diffusion Model (DADM). Specifically, we construct a set of ordinary differential equations to provide external physical knowledge (EPK) of the realistic ECG morphology. The EPK is used to adaptively constrain the morphology of the generated ECGs through a dynamic cross-attention (DCA) mechanism. Furthermore, we propose an extension of ControlNet to incorporate demographic and drug data, simulating individual drug reactions. Compared to the other eight state-of-the-art (SOTA) ECG generative models: 1) Quantitative and expert evaluation demonstrate that DADM generates ECGs with superior fidelity; 2) Comparative results on two real-world databases covering 8 types of drug regimens verify that DADM can more accurately simulate drug-induced changes in ECGs, improving the accuracy by at least 5.79% and recall by 8%. In addition, the ECGs generated by DADM can also enhance model performance in downstream drug-effect classification tasks.
中文摘要:本文提出的药物感知扩散模型通过整合物理知识和药物数据,显著提升了心电图生成的保真度,能更准确地模拟药物引起的心电图变化,在多个评估指标上优于现有先进方法。
English Summary: The proposed Drug-Aware Diffusion Model (DADM) enhances ECG simulation by incorporating physical knowledge and drug data, achieving superior fidelity and more accurate prediction of drug-induced ECG changes compared to existing methods.

Authors:Xiao-Wen Yang, Xuan-Yi Zhu, Wen-Da Wei, Ding-Chu Zhang, Jie-Jing Shao, Zhi Zhou, Lan-Zhe Guo, Yu-Feng Li
Title: Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models
Abstract:
The integration of slow-thinking mechanisms into large language models (LLMs) offers a promising way toward achieving Level 2 AGI Reasoners, as exemplified by systems like OpenAI's o1. However, several significant challenges remain, including inefficient overthinking and an overreliance on auxiliary reward models. We point out that these limitations stem from LLMs' inability to internalize the search process, a key component of effective reasoning. A critical step toward addressing this issue is enabling LLMs to autonomously determine when and where to backtrack, a fundamental operation in traditional search algorithms. To this end, we propose a self-backtracking mechanism that equips LLMs with the ability to backtrack during both training and inference. This mechanism not only enhances reasoning ability but also efficiency by transforming slow-thinking processes into fast-thinking through self-improvement. Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40 percent compared to the optimal-path supervised fine-tuning method. We believe this study introduces a novel and promising pathway for developing more advanced and robust Reasoners.
中文摘要:将慢思考机制融入大型语言模型有望推进通用人工智能推理,但存在低效过度思考等挑战,而提出的自回溯机制通过自主回溯能力显著提升了推理性能与效率。
English Summary: Integrating slow-thinking mechanisms into LLMs shows promise for advancing AGI reasoning, but challenges like inefficient overthinking persist, which a proposed self-backtracking mechanism addresses by enabling autonomous backtracking to boost both reasoning ability and efficiency.

Authors:Boyu Mi, Hanqing Wang, Tai Wang, Yilun Chen, Jiangmiao Pang
Title: Language-to-Space Programming for Training-Free 3D Visual Grounding
Abstract:
3D visual grounding (3DVG) is challenging due to the need to understand 3D spatial relations. While supervised approaches have achieved superior performance, they are constrained by the scarcity and high annotation costs of 3D vision-language datasets. Training-free approaches based on LLMs/VLMs eliminate the need for large-scale training data, but they either incur prohibitive grounding time and token costs or have unsatisfactory accuracy. To address the challenges, we introduce a novel method for training-free 3D visual grounding, namely Language-to-Space Programming (LaSP). LaSP introduces LLM-generated codes to analyze 3D spatial relations among objects, along with a pipeline that evaluates and optimizes the codes automatically. Experimental results demonstrate that LaSP achieves 52.9% accuracy on the Nr3D benchmark, ranking among the best training-free methods. Moreover, it substantially reduces the grounding time and token costs, offering a balanced trade-off between performance and efficiency.
中文: 提出的语言到空间编程(LaSP)方法在Nr3D基准测试中达到52.9%准确率,提供了一种无需训练的3D视觉定位方案,在性能与降低的时间和令牌成本之间实现平衡。
English: The proposed Language-to-Space Programming (LaSP) method achieves 52.9% accuracy on the Nr3D benchmark, offering a training-free 3D visual grounding solution that balances performance with reduced time and token costs.

Authors:Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, Giorgos Stamou
Title: Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations
Abstract:
The advent of Large Language Models (LLMs) has revolutionized product recommenders, yet their susceptibility to adversarial manipulation poses critical challenges, particularly in real-world commercial applications. Our approach is the first one to tap into human psychological principles, seamlessly modifying product descriptions, making such manipulations hard to detect. In this work, we investigate cognitive biases as black-box adversarial strategies, drawing parallels between their effects on LLMs and human purchasing behavior. Through extensive evaluation across models of varying scale, we find that certain biases, such as social proof, consistently boost product recommendation rate and ranking, while others, like scarcity and exclusivity, surprisingly reduce visibility. Our results demonstrate that cognitive biases are deeply embedded in state-of-the-art LLMs, leading to highly unpredictable behavior in product recommendations and posing significant challenges for effective mitigation.
中文摘要:本研究探讨了如何将认知偏见作为对抗性策略操纵大型语言模型的产品推荐,发现社会认同等偏见能提升产品可见度,而稀缺性却意外降低排名,导致缓解措施面临重大挑战。
English Summary: This study explores how cognitive biases can be manipulated as adversarial strategies in large language models to influence product recommendations, revealing that biases like social proof enhance visibility while scarcity unexpectedly reduces it, making mitigation challenging.

Authors:Shangjin Zhai, Nan Wang, Xiaomeng Wang, Danpeng Chen, Weijian Xie, Hujun Bao, Guofeng Zhang
Title: XR-VIO: High-precision Visual Inertial Odometry with Fast Initialization for XR Applications
Abstract:
This paper presents a novel approach to Visual Inertial Odometry (VIO), focusing on the initialization and feature matching modules. Existing methods for initialization often suffer from either poor stability in visual Structure from Motion (SfM) or fragility in solving a huge number of parameters simultaneously. To address these challenges, we propose a new pipeline for visual inertial initialization that robustly handles various complex scenarios. By tightly coupling gyroscope measurements, we enhance the robustness and accuracy of visual SfM. Our method demonstrates stable performance even with only four image frames, yielding competitive results. In terms of feature matching, we introduce a hybrid method that combines optical flow and descriptor-based matching. By leveraging the robustness of continuous optical flow tracking and the accuracy of descriptor matching, our approach achieves efficient, accurate, and robust tracking results. Through evaluation on multiple benchmarks, our method demonstrates state-of-the-art performance in terms of accuracy and success rate. Additionally, a video demonstration on mobile devices showcases the practical applicability of our approach in the field of Augmented Reality/Virtual Reality (AR/VR).
中文摘要:本文提出一种新颖的视觉惯性里程计方法,通过融合陀螺仪数据增强视觉SfM鲁棒性,并结合光流与描述符的混合特征匹配技术,在移动端AR/VR应用中实现了最优性能。
English Summary: This paper introduces a robust VIO initialization method that integrates gyroscope data to enhance visual SfM stability and a hybrid feature matching technique combining optical flow with descriptor matching, achieving state-of-the-art accuracy and mobile AR/VR applicability.

Authors:Zhi Zhou, Tan Yuhao, Zenan Li, Yuan Yao, Lan-Zhe Guo, Xiaoxing Ma, Yu-Feng Li
Title: Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning
Abstract:
Recent advancements in large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, single-shot inference often yields unreliable results for complex reasoning tasks, leading researchers to explore multiple reasoning paths through methods such as perplexity and self-consistency. In this paper, we present the first theoretical error decomposition analysis of these techniques, breaking down their error into estimation error and model error. Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function, while self-consistency exhibits high estimation error due to a slow error convergence rate. To overcome these limitations, we propose Reasoning-Pruning Perplexity Consistency (RPC). This approach combines Perplexity Consistency, which seamlessly integrates LLM perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths to effectively prevent the degeneration of estimation error reduction. Theoretical analysis demonstrates that RPC not only accelerates the convergence rate of estimation error to an exponential level but also holds strong potential for further reducing model error. Extensive empirical evaluations on seven benchmark datasets confirm that RPC can significantly improve reasoning performance, sample efficiency, and confidence reliability.
中文摘要:本文提出RPC方法,通过结合困惑度一致性与推理剪枝技术,有效解决现有大语言模型推理方法的误差收敛慢和模型误差高的问题,在多个基准测试中显著提升了推理性能与置信度可靠性。
English Summary: This paper introduces RPC, a novel method that combines perplexity consistency with reasoning pruning to address the limitations of existing LLM reasoning techniques by accelerating error convergence and reducing model error, as validated across multiple benchmarks.

Authors:James Begin, Namit Agrawal, Eshan Singh, Yicheng Fu, Sean O'Brien, Vasu Sharma, Kevin Zhu
Title: Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration
Abstract:
LLMs have demonstrated remarkable proficiency in understanding tasks but continue to struggle with long-context comprehension, particularly with content located in the middle of extensive inputs. This limitation, known as the Lost-in-the-Middle (LITM) problem, hinders models from fully processing and utilizing information across lengthy contexts. To address this issue, we introduce pause-tuning, a technique that redistributes attention to enhance comprehension of long-context inputs. Our approach involves fine-tuning language models on datasets with artificially inserted pause tokens, which serve to segment the input into smaller, more manageable parts. We evaluate pause-tuning against alternative approaches using the Needle-in-a-Haystack benchmark, where models must retrieve information embedded within contexts of up to 128K tokens. Experimental results demonstrate significant performance gains, with the LLaMA 3.2 3B Instruct model and the LLaMA 3.1 8B Instruct model improving by 10.61% and 3.57% respectively on average, suggesting that pause-tuning successfully enhances attention redistribution and improves long-context retention. The code and data are available at https://anonymous.4open.science/r/LITM-PauseTokens-7357.
中文: 为解决大语言模型在长文本理解中的“迷失在中间”问题,本文提出暂停调优技术,通过人工暂停令牌重新分配注意力,在针在草堆基准测试中显著提升了模型的长上下文理解能力。
English: Pause-tuning is introduced to address the Lost-in-the-Middle problem in LLMs by redistributing attention through artificial pause tokens, significantly improving long-context comprehension as demonstrated by performance gains on the Needle-in-a-Haystack benchmark.

Authors:Jiaming Zhou, Yujie Guo, Shiwan Zhao, Haoqin Sun, Hui Wang, Jiabei He, Aobo Kong, Shiyao Wang, Xi Yang, Yequan Wang, Yonghua Lin, Yong Qin
Title: CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition
Abstract:
Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for real-world conversational scenarios. This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. Unlike previous datasets, CS-Dialogue provides full-length dialogue recordings with complete transcriptions, capturing naturalistic code-switching patterns in continuous speech. We describe the data collection and annotation processes, present detailed statistics of the dataset, and establish benchmark ASR performance using state-of-the-art models. Our experiments, using Transformer, Conformer, and Branchformer, demonstrate the challenges of code-switching ASR, and show that existing pre-trained models such as Whisper still have the space to improve. The CS-Dialogue dataset will be made freely available for all academic purposes.
中文:本文推出CS-Dialogue大规模中英语码转换数据集,包含104小时自然对话及完整转录,弥补现有数据缺陷;基准测试表明即使Whisper等先进模型在语码转换识别方面仍有提升空间。
English: This paper introduces CS-Dialogue, a large-scale Mandarin-English code-switching dataset featuring 104 hours of spontaneous dialogues with full transcriptions to address limitations in existing datasets and advance ASR research, with benchmark tests revealing performance gaps even in state-of-the-art models like Whisper.

Authors:Ranjan Sapkota, Shaina Raza, Manoj Karkee
Title: Comprehensive Analysis of Transparency and Accessibility of ChatGPT, DeepSeek, And other SoTA Large Language Models
Abstract:
Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs). The Open Source Initiative (OSI) has recently released its first formal definition of open-source software. This definition, when combined with standard dictionary definitions and the sparse published literature, provide an initial framework to support broader accessibility to AI models such as LLMs, but more work is essential to capture the unique dynamics of openness in AI. In addition, concerns about open-washing, where models claim openness but lack full transparency, has been raised, which limits the reproducibility, bias mitigation, and domain adaptation of these models. In this context, our study critically analyzes SoTA LLMs from the last five years, including ChatGPT, DeepSeek, LLaMA, and others, to assess their adherence to transparency standards and the implications of partial openness. Specifically, we examine transparency and accessibility from two perspectives: open-source vs. open-weight models. Our findings reveal that while some models are labeled as open-source, this does not necessarily mean they are fully open-sourced. Even in the best cases, open-source models often do not report model training data, and code as well as key metrics, such as weight accessibility, and carbon emissions. To the best of our knowledge, this is the first study that systematically examines the transparency and accessibility of over 100 different SoTA LLMs through the dual lens of open-source and open-weight models. The findings open avenues for further research and call for responsible and sustainable AI practices to ensure greater transparency, accountability, and ethical deployment of these models.(DeepSeek transparency, ChatGPT accessibility, open source, DeepSeek open source)
中文: 本研究批判性分析了过去五年中的100多个先进大语言模型,揭示许多所谓“开源”模型在训练数据、代码和关键指标方面缺乏完全透明度,同时指出了开放洗白的担忧,并呼吁建立更负责任的AI实践。
English: This study critically analyzes over 100 state-of-the-art large language models from the past five years, revealing that many so-called "open-source" models lack full transparency in training data, code, and key metrics, while highlighting concerns about open-washing and calling for more responsible AI practices.

Authors:Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang
Title: SAE-V: Interpreting Multimodal Models for Enhanced Alignment
Abstract:
With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying and analyzing interpretable features along with their corresponding data, SAE-V enables fine-grained interpretation of both model behavior and data quality, facilitating a deeper understanding of cross-modal interactions and alignment dynamics. Moreover, by utilizing cross-modal feature weighting, SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models. Specifically, when applied to the alignment process of MLLMs, SAE-V-based data filtering methods could achieve more than 110% performance with less than 50% data. Our results highlight SAE-V's ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.
中文: SAE-V框架将稀疏自编码器扩展至多模态大语言模型,通过细粒度跨模态交互解析和内置数据过滤机制,仅用不到50%的数据即可实现110%以上的对齐性能提升。
English: The SAE-V framework extends sparse autoencoders to multimodal large language models, enabling fine-grained interpretation of cross-modal interactions and an intrinsic data filtering mechanism that boosts alignment performance by over 110% using less than half the data.

Authors:Yuxuan Liu, Hongda Sun, Wei Liu, Jian Luan, Bo Du, Rui Yan
Title: MobileSteward: Integrating Multiple App-Oriented Agents with Self-Evolution to Automate Cross-App Instructions
Abstract:
Mobile phone agents can assist people in automating daily tasks on their phones, which have emerged as a pivotal research spotlight. However, existing procedure-oriented agents struggle with cross-app instructions, due to the following challenges: (1) complex task relationships, (2) diverse app environment, and (3) error propagation and information loss in multi-step execution. Drawing inspiration from object-oriented programming principles, we recognize that object-oriented solutions is more suitable for cross-app instruction. To address these challenges, we propose a self-evolving multi-agent framework named MobileSteward, which integrates multiple app-oriented StaffAgents coordinated by a centralized StewardAgent. We design three specialized modules in MobileSteward: (1) Dynamic Recruitment generates a scheduling graph guided by information flow to explicitly associate tasks among apps. (2) Assigned Execution assigns the task to app-oriented StaffAgents, each equipped with app-specialized expertise to address the diversity between apps. (3) Adjusted Evaluation conducts evaluation to provide reflection tips or deliver key information, which alleviates error propagation and information loss during multi-step execution. To continuously improve the performance of MobileSteward, we develop a Memory-based Self-evolution mechanism, which summarizes the experience from successful execution, to improve the performance of MobileSteward. We establish the first English Cross-APP Benchmark (CAPBench) in the real-world environment to evaluate the agents' capabilities of solving complex cross-app instructions. Experimental results demonstrate that MobileSteward achieves the best performance compared to both single-agent and multi-agent frameworks, highlighting the superiority of MobileSteward in better handling user instructions with diverse complexity.
中文: MobileSteward是一个自我演进的多智能体框架,通过动态任务调度、分配执行和自适应评估来协调专业智能体,有效解决了跨应用自动化中的复杂任务关系、环境差异和错误传播等挑战,在处理复杂指令方面展现出卓越性能。
English: MobileSteward is a self-evolving multi-agent framework that addresses cross-app automation challenges by coordinating specialized agents through dynamic task scheduling, assigned execution, and adaptive evaluation, demonstrating superior performance in handling complex instructions.

Authors:Pengxiang Ding, Jianfei Ma, Xinyang Tong, Binghong Zou, Xinxin Luo, Yiguo Fan, Ting Wang, Hongchao Lu, Panzhong Mo, Jinxin Liu, Yuefan Wang, Huaicheng Zhou, Wenshuo Feng, Jiacheng Liu, Siteng Huang, Donglin Wang
Title: Humanoid-VLA: Towards Universal Humanoid Control with Visual Integration
Abstract:
This paper addresses the limitations of current humanoid robot control frameworks, which primarily rely on reactive mechanisms and lack autonomous interaction capabilities due to data scarcity. We propose Humanoid-VLA, a novel framework that integrates language understanding, egocentric scene perception, and motion control, enabling universal humanoid control. Humanoid-VLA begins with language-motion pre-alignment using non-egocentric human motion datasets paired with textual descriptions, allowing the model to learn universal motion patterns and action semantics. We then incorporate egocentric visual context through a parameter efficient video-conditioned fine-tuning, enabling context-aware motion generation. Furthermore, we introduce a self-supervised data augmentation strategy that automatically generates pseudoannotations directly derived from motion data. This process converts raw motion sequences into informative question-answer pairs, facilitating the effective use of large-scale unlabeled video data. Built upon whole-body control architectures, extensive experiments show that Humanoid-VLA achieves object interaction and environment exploration tasks with enhanced contextual awareness, demonstrating a more human-like capacity for adaptive and intelligent engagement.
中文: 本文提出Humanoid-VLA框架,通过整合语言理解、自我中心视觉和运动控制,结合高效数据学习和自监督增强策略,克服了现有人形机器人反应式控制的局限,实现了自主交互和类人适应能力。
English: This paper introduces Humanoid-VLA, a novel framework that integrates language understanding, egocentric vision, and motion control to overcome the limitations of reactive humanoid robots, enabling autonomous interaction through data-efficient learning and self-supervised data augmentation.

Authors:Ilias Diakonikolas, Giannis Iakovidis, Daniel M. Kane, Thanasis Pittas
Title: Efficient Multivariate Robust Mean Estimation Under Mean-Shift Contamination
Abstract:
We study the algorithmic problem of robust mean estimation of an identity covariance Gaussian in the presence of mean-shift contamination. In this contamination model, we are given a set of points in $\mathbb{R}^d$ generated i.i.d. via the following process. For a parameter $α<1/2$, the $i$-th sample $x_i$ is obtained as follows: with probability $1-α$, $x_i$ is drawn from $\mathcal{N}(μ, I)$, where $μ\in \mathbb{R}^d$ is the target mean; and with probability $α$, $x_i$ is drawn from $\mathcal{N}(z_i, I)$, where $z_i$ is unknown and potentially arbitrary. Prior work characterized the information-theoretic limits of this task. Specifically, it was shown that, in contrast to Huber contamination, in the presence of mean-shift contamination consistent estimation is possible. On the other hand, all known robust estimators in the mean-shift model have running times exponential in the dimension. Here we give the first computationally efficient algorithm for high-dimensional robust mean estimation with mean-shift contamination that can tolerate a constant fraction of outliers. In particular, our algorithm has near-optimal sample complexity, runs in sample-polynomial time, and approximates the target mean to any desired accuracy. Conceptually, our result contributes to a growing body of work that studies inference with respect to natural noise models lying in between fully adversarial and random settings.
中文: 本文首次提出了在均值漂移污染模型下进行鲁棒均值估计的计算高效算法,该算法具有接近最优的样本复杂度、多项式运行时间,并能以任意精度逼近目标均值。
English: This paper presents the first computationally efficient algorithm for robust mean estimation under mean-shift contamination, achieving near-optimal sample complexity and polynomial runtime while approximating the target mean to arbitrary accuracy.

Authors:Moustapha Awwalou Diouf, Samuel Ouya, Jacques Klein, Tegawendé F. Bissyandé
Title: Software Security in Software-Defined Networking: A Systematic Literature Review
Abstract:
Software-defined networking (SDN) has shifted network management by decoupling the data and control planes. This enables programmatic control via software applications using open APIs. SDN's programmability has fueled its popularity but may have opened issues extending the attack surface by introducing vulnerable software. Therefore, the research community needs to have a deep and broad understanding of the risks posed by SDN to propose mitigating measures. The literature, however, lacks a comprehensive review of the current state of research in this direction. This paper addresses this gap by providing a comprehensive overview of the state-of-the-art research in SDN security focusing on the software (i.e., the controller, APIs, applications) part. We systematically reviewed 58 relevant publications to analyze trends, identify key testing and analysis methodologies, and categorize studied vulnerabilities. We further explore areas where the research community can make significant contributions. This work offers the most extensive and in-depth analysis of SDN software security to date.
中文: 本文通过系统分析58篇相关文献,对SDN软件安全研究现状进行了全面评述,识别了漏洞类型并指出未来研究方向,是目前最深入的领域分析。
English: This paper provides a comprehensive review of SDN software security by analyzing 58 publications to identify vulnerabilities and research gaps, offering the most extensive analysis to date.

Authors:Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang
Title: VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation
Abstract:
Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. We also present two new datasets, SQA and CSI, to support a three-stage tuning process for speech instructions, which empowers VLAS with the ability of multimodal interaction across text, image, speech, and robot actions. Taking a step further, a voice retrieval-augmented generation (RAG) paradigm is designed to enable our model to effectively handle tasks that require individual-specific knowledge. Our extensive experiments show that VLAS can effectively accomplish robot manipulation tasks with diverse speech commands, offering a seamless and customized interaction experience.
中文摘要:VLAS是一种新型端到端视觉-语言-动作模型,将语音识别直接集成到机器人策略中,通过语音检索增强生成技术实现自然语音操控,同时保留非语义语音信息以完成个性化任务。
English Summary: VLAS is a novel end-to-end vision-language-action model that integrates speech recognition directly into robot policy, enabling natural voice-controlled manipulation while preserving non-semantic speech information through voice retrieval-augmented generation.

Authors:Huaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao, Zhicheng Dou, Ji-Rong Wen
Title: MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval
Abstract:
Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(https://yhy-2000.github.io/MomentSeeker/) to facilitate future research in this area.
中文: MomentSeeker是一个新颖的长视频片段检索基准,通过采用多样化的长视频和多种查询形式,解决了现有基准在评估关键时刻准确定位方面的不足。
English: MomentSeeker is a new benchmark for long-video moment retrieval that uses diverse, lengthy videos and various query types to address the limitations of existing benchmarks in evaluating accurate key moment localization.

Authors:Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han
Title: GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
Abstract:
The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework. However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG). It is underexplored how MLLMs can incorporate the relational information (\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts and images) on such graphs for multimodal comprehension and generation. In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs. We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs. Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs. Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method. Datasets and codes will be open-sourced upon acceptance.
Chinese: 本文提出GraphGPT-o方法,通过线性化技术和分层对齐器,将多模态属性图的结构与语义信息整合到多模态大语言模型中,实现了对图数据的全面理解与生成能力。
English: This paper introduces GraphGPT-o, a novel method that enhances Multimodal Large Language Models to effectively understand and generate content from multimodal attributed graphs by integrating both structural and semantic information through linearization techniques and a hierarchical aligner.

Authors:SeongKu Kang, Bowen Jin, Wonbin Kweon, Yu Zhang, Dongha Lee, Jiawei Han, Hwanjo Yu
Title: Improving Scientific Document Retrieval with Concept Coverage-based Query Set Generation
Abstract:
In specialized fields like the scientific domain, constructing large-scale human-annotated datasets poses a significant challenge due to the need for domain expertise. Recent methods have employed large language models to generate synthetic queries, which serve as proxies for actual user queries. However, they lack control over the content generated, often resulting in incomplete coverage of academic concepts in documents. We introduce Concept Coverage-based Query set Generation (CCQGen) framework, designed to generate a set of queries with comprehensive coverage of the document's concepts. A key distinction of CCQGen is that it adaptively adjusts the generation process based on the previously generated queries. We identify concepts not sufficiently covered by previous queries, and leverage them as conditions for subsequent query generation. This approach guides each new query to complement the previous ones, aiding in a thorough understanding of the document. Extensive experiments demonstrate that CCQGen significantly enhances query quality and retrieval performance.
中文:CCQGen框架通过自适应调整查询以全面覆盖文档概念,解决了合成查询生成中概念覆盖不足的问题,显著提升了检索性能。
English: The CCQGen framework addresses the challenge of incomplete concept coverage in synthetic query generation by adaptively adjusting queries to ensure comprehensive document understanding, significantly improving retrieval performance.

Authors:Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin
Title: FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
Abstract:
To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in https://aka.ms/felle.
中文: FELLE是一种自回归模型,融合语言建模与逐令牌流匹配技术,通过改进先验分布和采用由粗到细的分层生成机制,有效预测连续值梅尔频谱图令牌,显著提升了时序连贯性和语音合成质量。
English: FELLE is an autoregressive model that combines language modeling with token-wise flow matching to predict continuous-valued mel-spectrogram tokens, enhancing temporal coherence and synthesis quality through a modified prior distribution and coarse-to-fine hierarchical generation.

Authors:Qin Liu, Fei Wang, Chaowei Xiao, Muhao Chen
Title: VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap
Abstract:
The emergence of vision language models (VLMs) comes with increased safety concerns, as the incorporation of multiple modalities heightens vulnerability to attacks. Although VLMs can be built upon LLMs that have textual safety alignment, it is easily undermined when the vision modality is integrated. We attribute this safety challenge to the modality gap, a separation of image and text in the shared representation space, which blurs the distinction between harmful and harmless queries that is evident in LLMs but weakened in VLMs. To avoid safety decay and fulfill the safety alignment gap, we propose VLM-Guard, an inference-time intervention strategy that leverages the LLM component of a VLM as supervision for the safety alignment of the VLM. VLM-Guard projects the representations of VLM into the subspace that is orthogonal to the safety steering direction that is extracted from the safety-aligned LLM. Experimental results on three malicious instruction settings show the effectiveness of VLM-Guard in safeguarding VLM and fulfilling the safety alignment gap between VLM and its LLM component.
中文摘要:视觉语言模型(VLM)因图像与文本间的模态差异面临更大安全风险,削弱了从大型语言模型继承的安全对齐效果,而VLM-Guard通过在推理时利用LLM组件进行安全监督,成功填补了这一安全对齐缺口。
English Summary: Vision language models (VLMs) face heightened safety risks due to the modality gap between images and texts, which weakens the safety alignment inherited from LLMs, but VLM-Guard effectively addresses this by realigning VLM representations using LLM-based safety supervision during inference.

Authors:Harsh Poonia, Felix Divo, Kristian Kersting, Devendra Singh Dhami
Title: Exploring Neural Granger Causality with xLSTMs: Unveiling Temporal Dependencies in Complex Data
Abstract:
Causality in time series can be difficult to determine, especially in the presence of non-linear dependencies. The concept of Granger causality helps analyze potential relationships between variables, thereby offering a method to determine whether one time series can predict - Granger cause - future values of another. Although successful, Granger causal methods still struggle with capturing long-range relations between variables. To this end, we leverage the recently successful Extended Long Short-Term Memory (xLSTM) architecture and propose Granger causal xLSTMs (GC-xLSTM). It first enforces sparsity between the time series components by using a novel dynamic loss penalty on the initial projection. Specifically, we adaptively improve the model and identify sparsity candidates. Our joint optimization procedure then ensures that the Granger causal relations are recovered robustly. Our experimental evaluation on six diverse datasets demonstrates the overall efficacy of our proposed GC-xLSTM model.
中文: 提出的GC-xLSTM模型通过引入动态损失惩罚来增强变量间稀疏性,改进了格兰杰因果关系分析,能稳健捕捉时间序列中的长程依赖关系,并在多个数据集上验证了其优越性能。
English: The proposed GC-xLSTM model enhances Granger causality analysis by incorporating a novel dynamic loss penalty to enforce sparsity and robustly capture long-range dependencies in time series, demonstrating superior performance across diverse datasets.

Authors:Xulu Zhang, Xiaoyong Wei, Jinlin Wu, Jiaxin Wu, Zhaoxiang Zhang, Zhen Lei, Qing Li
Title: Generating on Generated: An Approach Towards Self-Evolving Diffusion Models
Abstract:
Recursive Self-Improvement (RSI) enables intelligence systems to autonomously refine their capabilities. This paper explores the application of RSI in text-to-image diffusion models, addressing the challenge of training collapse caused by synthetic data. We identify two key factors contributing to this collapse: the lack of perceptual alignment and the accumulation of generative hallucinations. To mitigate these issues, we propose three strategies: (1) a prompt construction and filtering pipeline designed to facilitate the generation of perceptual aligned data, (2) a preference sampling method to identify human-preferred samples and filter out generative hallucinations, and (3) a distribution-based weighting scheme to penalize selected samples with hallucinatory errors. Our extensive experiments validate the effectiveness of these approaches.
中文: 本文针对文本到图像扩散模型中的训练崩溃问题,提出了三种策略来缓解递归自我改进过程中的感知错位和生成幻觉积累。
English: This paper addresses training collapse in text-to-image diffusion models by proposing three strategies to mitigate perceptual misalignment and generative hallucinations during recursive self-improvement.

Authors:Ilias Diakonikolas, Giannis Iakovidis, Daniel M. Kane, Nikos Zarifis
Title: Robust Learning of Multi-index Models via Iterative Subspace Approximation
Abstract:
We study the task of learning Multi-Index Models (MIMs) with label noise under the Gaussian distribution. A $K$-MIM is any function $f$ that only depends on a $K$-dimensional subspace. We focus on well-behaved MIMs with finite ranges that satisfy certain regularity properties. Our main contribution is a general robust learner that is qualitatively optimal in the Statistical Query (SQ) model. Our algorithm iteratively constructs better approximations to the defining subspace by computing low-degree moments conditional on the projection to the subspace computed thus far, and adding directions with relatively large empirical moments. This procedure efficiently finds a subspace $V$ so that $f(\mathbf{x})$ is close to a function of the projection of $\mathbf{x}$ onto $V$. Conversely, for functions for which these conditional moments do not help, we prove an SQ lower bound suggesting that no efficient learner exists. As applications, we provide faster robust learners for the following concept classes: * {\bf Multiclass Linear Classifiers} We give a constant-factor approximate agnostic learner with sample complexity $N = O(d) 2^{\mathrm{poly}(K/ε)}$ and computational complexity $\mathrm{poly}(N ,d)$. This is the first constant-factor agnostic learner for this class whose complexity is a fixed-degree polynomial in $d$. * {\bf Intersections of Halfspaces} We give an approximate agnostic learner for this class achieving 0-1 error $K \tilde{O}(\mathrm{OPT}) + ε$ with sample complexity $N=O(d^2) 2^{\mathrm{poly}(K/ε)}$ and computational complexity $\mathrm{poly}(N ,d)$. This is the first agnostic learner for this class with near-linear error dependence and complexity a fixed-degree polynomial in $d$. Furthermore, we show that in the presence of random classification noise, the complexity of our algorithm scales polynomially with $1/ε$.
中文: 本文针对带标签噪声的多指标模型提出了统计最优的鲁棒学习算法,通过迭代计算矩来高效逼近低维子空间,并为多类线性分类器和半空间交集实现了计算复杂度为固定多项式阶的突破性改进。
English: This paper presents a statistically optimal robust learning algorithm for multi-index models with label noise, which efficiently approximates low-dimensional subspaces through iterative moment calculations and achieves polynomial complexity improvements for multiclass linear classifiers and intersections of halfspaces.

Authors:Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang
Title: GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
Abstract:
With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment. These perturbations introduce unforeseen state information to the VLA, resulting in inaccurate actions and consequently, a significant decline in generalization performance. The classic internal model control (IMC) principle demonstrates that a closed-loop system with an internal model that includes external input signals can accurately track the reference input and effectively offset the disturbance. We propose a novel closed-loop VLA method GEVRM that integrates the IMC principle to enhance the robustness of robot visual manipulation. The text-guided video generation model in GEVRM can generate highly expressive future visual planning goals. Simultaneously, we evaluate perturbations by simulating responses, which are called internal embeddings and optimized through prototype contrastive learning. This allows the model to implicitly infer and distinguish perturbations from the external environment. The proposed GEVRM achieves state-of-the-art performance on both standard and perturbed CALVIN benchmarks and shows significant improvements in realistic robot tasks.
中文: 提出的GEVRM方法融合内模控制原理,通过生成未来视觉目标和原型对比学习隐式推断外部扰动,增强了机器人视觉操作的鲁棒性,在基准测试中达到最优性能。
English: The proposed GEVRM method integrates the internal model control principle to enhance robot visual manipulation robustness by generating future visual goals and implicitly inferring external perturbations through prototype contrastive learning, achieving state-of-the-art performance on benchmarks.

Authors:Abhishek Srivastava, Koushik Biswas, Gorkem Durak, Gulsah Ozden, Mustafa Adli, Ulas Bagci
Title: Is Long Range Sequential Modeling Necessary For Colorectal Tumor Segmentation?
Abstract:
Segmentation of colorectal cancer (CRC) tumors in 3D medical imaging is both complex and clinically critical, providing vital support for effective radiation therapy planning and survival outcome assessment. Recently, 3D volumetric segmentation architectures incorporating long-range sequence modeling mechanisms, such as Transformers and Mamba, have gained attention for their capacity to achieve high accuracy in 3D medical image segmentation. In this work, we evaluate the effectiveness of these global token modeling techniques by pitting them against our proposed MambaOutUNet within the context of our newly introduced colorectal tumor segmentation dataset (CTS-204). Our findings suggest that robust local token interactions can outperform long-range modeling techniques in cases where the region of interest is small and anatomically complex, proposing a potential shift in 3D tumor segmentation research.
中文摘要:本研究通过对比采用Transformer和Mamba等长程建模技术的三维分割模型与提出的MambaOutUNet在新结直肠肿瘤数据集上的表现,发现针对小而复杂的肿瘤区域,强化的局部特征交互可能优于全局建模方法。
English Summary: Recent 3D segmentation models using long-range modeling techniques like Transformers and Mamba are compared against the proposed MambaOutUNet on a new colorectal tumor dataset, revealing that robust local interactions can outperform global modeling for small, complex tumor regions.

Authors:Bo Gao, Yuan Wang, Qingsong Wei, Yong Liu, Rick Siow Mong Goh, David Lo
Title: AiRacleX: Automated Detection of Price Oracle Manipulations via LLM-Driven Knowledge Mining and Prompt Generation
Abstract:
Decentralized finance (DeFi) applications depend on accurate price oracles to ensure secure transactions, yet these oracles are highly vulnerable to manipulation, enabling attackers to exploit smart contract vulnerabilities for unfair asset valuation and financial gain. Detecting such manipulations traditionally relies on the manual effort of experienced experts, presenting significant challenges. In this paper, we propose a novel LLM-driven framework that automates the detection of price oracle manipulations by leveraging the complementary strengths of different LLM models (LLMs). Our approach begins with domain-specific knowledge extraction, where an LLM model synthesizes precise insights about price oracle vulnerabilities from top-tier academic papers, eliminating the need for profound expertise from developers or auditors. This knowledge forms the foundation for a second LLM model to generate structured, context-aware chain of thought prompts, which guide a third LLM model in accurately identifying manipulation patterns in smart contracts. We validate the effectiveness of framework through experiments on 60 known vulnerabilities from 46 real-world DeFi attacks or projects spanning 2021 to 2023. The best performing combination of LLMs (Haiku-Haiku-4o-mini) identified by AiRacleX demonstrate a 2.58-times improvement in recall (0.667 vs 0.259) compared to the state-of-the-art tool GPTScan, while maintaining comparable precision. Furthermore, our framework demonstrates the feasibility of replacing commercial models with open-source alternatives, enhancing privacy and security for developers.
中文摘要:本文提出了一种基于大语言模型的框架,通过协同多个模型进行知识提取和模式分析,实现了去中心化金融中价格预言机操纵的自动检测,在显著提升召回率的同时支持开源模型以增强隐私保护。
English Summary: This paper introduces an LLM-driven framework that automates the detection of price oracle manipulations in DeFi by leveraging multiple LLMs for knowledge extraction and analysis, achieving significantly higher recall than existing tools while enabling the use of open-source models for enhanced privacy.

Authors:Yan Weng, Fengbin Zhu, Tong Ye, Haoyan Liu, Fuli Feng, Tat-Seng Chua
Title: Optimizing Knowledge Integration in Retrieval-Augmented Generation with Self-Selection
Abstract:
Retrieval-Augmented Generation (RAG), which integrates external knowledge into Large Language Models (LLMs), has proven effective in enabling LLMs to produce more accurate and reliable responses. However, it remains a significant challenge how to effectively integrate external retrieved knowledge with internal parametric knowledge in LLMs. In this work, we propose a novel Self-Selection RAG framework, where the LLM is made to select from pairwise responses generated with internal parametric knowledge solely and with external retrieved knowledge together to achieve enhanced accuracy. To this end, we devise a Self-Selection-RGP method to enhance the capabilities of the LLM in both generating and selecting the correct answer, by training the LLM with Direct Preference Optimization (DPO) over a curated Retrieval Generation Preference (RGP) dataset. Experimental results with two open-source LLMs (i.e., Llama2-13B-Chat and Mistral-7B) well demonstrate the superiority of our approach over other baseline methods on Natural Questions (NQ) and TrivialQA datasets.
中文:自选RAG框架通过让大语言模型在仅用内部参数知识与结合外部检索知识生成的成对回答中进行选择,从而提升准确性,该方法经DPO训练验证,在基准数据集上表现优于现有基线。
English: The Self-Selection RAG framework enhances LLM accuracy by enabling the model to choose between responses generated with internal knowledge alone or combined with external retrieved knowledge, validated through DPO training on a curated dataset and superior performance on benchmark tests.

Authors:Bingjie Wu, Zitong Yu, Yiping Xie, Wei Liu, Chaoqi Luo, Yong Liu, Rick Siow Mong Goh
Title: Semi-rPPG: Semi-Supervised Remote Physiological Measurement with Curriculum Pseudo-Labeling
Abstract:
Remote Photoplethysmography (rPPG) is a promising technique to monitor physiological signals such as heart rate from facial videos. However, the labeled facial videos in this research are challenging to collect. Current rPPG research is mainly based on several small public datasets collected in simple environments, which limits the generalization and scale of the AI models. Semi-supervised methods that leverage a small amount of labeled data and abundant unlabeled data can fill this gap for rPPG learning. In this study, a novel semi-supervised learning method named Semi-rPPG that combines curriculum pseudo-labeling and consistency regularization is proposed to extract intrinsic physiological features from unlabelled data without impairing the model from noises. Specifically, a curriculum pseudo-labeling strategy with signal-to-noise ratio (SNR) criteria is proposed to annotate the unlabelled data while adaptively filtering out the low-quality unlabelled data. Besides, a novel consistency regularization term for quasi-periodic signals is proposed through weak and strong augmented clips. To benefit the research on semi-supervised rPPG measurement, we establish a novel semi-supervised benchmark for rPPG learning through intra-dataset and cross-dataset evaluation on four public datasets. The proposed Semi-rPPG method achieves the best results compared with three classical semi-supervised methods under different protocols. Ablation studies are conducted to prove the effectiveness of the proposed methods.
中文: 本研究提出名为Semi-rPPG的半监督学习方法,通过课程伪标签策略和一致性正则化技术,有效利用未标记面部视频提取生理特征,解决了远程光电容积描记技术中标注数据稀缺的问题,并在基准测试中展现出最优性能。
English: This study introduces Semi-rPPG, a novel semi-supervised learning method that employs curriculum pseudo-labeling and consistency regularization to effectively extract physiological features from unlabeled facial videos, addressing the scarcity of labeled data in remote photoplethysmography research and demonstrating superior performance on benchmark evaluations.

Authors:Bowen Jin, Jinsung Yoon, Zhen Qin, Ziqi Wang, Wei Xiong, Yu Meng, Jiawei Han, Sercan O. Arik
Title: LLM Alignment as Retriever Optimization: An Information Retrieval Perspective
Abstract:
Large Language Models (LLMs) have revolutionized artificial intelligence with capabilities in reasoning, coding, and communication, driving innovation across industries. Their true potential depends on effective alignment to ensure correct, trustworthy and ethical behavior, addressing challenges like misinformation, hallucinations, bias and misuse. While existing Reinforcement Learning (RL)-based alignment methods are notoriously complex, direct optimization approaches offer a simpler alternative. In this work, we introduce a novel direct optimization approach for LLM alignment by drawing on established Information Retrieval (IR) principles. We present a systematic framework that bridges LLM alignment and IR methodologies, mapping LLM generation and reward models to IR's retriever-reranker paradigm. Building on this foundation, we propose LLM Alignment as Retriever Preference Optimization (LarPO), a new alignment method that enhances overall alignment quality. Extensive experiments validate LarPO's effectiveness with 38.9 % and 13.7 % averaged improvement on AlpacaEval2 and MixEval-Hard respectively. Our work opens new avenues for advancing LLM alignment by integrating IR foundations, offering a promising direction for future research.
中文: 本文提出LarPO这一新颖的直接优化方法,通过融合信息检索原理实现大语言模型对齐,在评估中展现出显著性能提升,为替代复杂的强化学习方法提供了更简洁的解决方案。
English: This paper introduces LarPO, a novel direct optimization method for Large Language Model (LLM) alignment that integrates Information Retrieval principles, demonstrating significant performance improvements in evaluations while offering a simpler alternative to complex reinforcement learning approaches.

Authors:Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiaqi Wang, Mengkang Hu, Zhi Chen, Wanxiang Che, Ting Liu
Title: ECM: A Unified Electronic Circuit Model for Explaining the Emergence of In-Context Learning and Chain-of-Thought in Large Language Model
Abstract:
Recent advancements in large language models (LLMs) have led to significant successes across various applications, where the most noticeable is to a series of emerging capabilities, particularly in the areas of In-Context Learning (ICL) and Chain-of-Thought (CoT). To better understand and control model performance, many studies have begun investigating the underlying causes of these phenomena and their impact on task outcomes. However, existing explanatory frameworks predominantly focus on isolating and explaining ICL and CoT independently, leading to an incomplete understanding of their combined influence on model performance. To address this gap, we propose the Electronic Circuit Model (ECM), which provides a foundation for developing scalable, learnable policies and improving the management of AI-generated content. Specifically, ECM conceptualizes model behavior as an electronic circuit: ICL is represented as semantic magnetic field to providing an additional voltage following Faraday's Law, while CoT is modeled as series resistors to constrain the model output performance following Ohm's Law. Experimental results demonstrate that the ECM effectively predicts and explains LLM performance across a variety of prompting strategies. Furthermore, we apply ECM to advanced reasoning strategy optimization on a series of tasks, such as the International Olympiad in Informatics (IOI) and the International Mathematical Olympiad (IMO), achieving competitive performance that surpasses nearly 80% of top human competitors.
Chinese: 本文提出电子电路模型(ECM),将上下文学习比作语义磁场、思维链比作串联电阻,为预测和优化大语言模型性能提供了统一框架,在国际信息学奥林匹克等任务中取得了超越近80%顶尖选手的竞争力表现。
English: This paper introduces the Electronic Circuit Model (ECM), which analogizes In-Context Learning to a semantic magnetic field and Chain-of-Thought to series resistors, providing a unified framework to predict and optimize LLM performance, achieving competitive results in tasks like IOI and IMO.

Authors:Qinzhuo Wu, Wei Liu, Jian Luan, Bin Wang
Title: ReachAgent: Enhancing Mobile Agent via Page Reaching and Operation
Abstract:
Recently, mobile AI agents have gained increasing attention. Given a task, mobile AI agents can interact with mobile devices in multiple steps and finally form a GUI flow that solves the task. However, existing agents tend to focus on most task-relevant elements at each step, leading to local optimal solutions and ignoring the overall GUI flow. To address this issue, we constructed a training dataset called MobileReach, which breaks the task into page reaching and operation subtasks. Furthermore, we propose ReachAgent, a two-stage framework that focuses on improving its task-completion abilities. It utilizes the page reaching and page operation subtasks, along with reward-based preference GUI flows, to further enhance the agent. Experimental results show that ReachAgent significantly improves the IoU Acc and Text Acc by 7.12% and 7.69% on the step-level and 4.72% and 4.63% on the task-level compared to the SOTA agent. Our data and code will be released upon acceptance.
Chinese: 为解决现有移动AI代理仅关注局部最优解的问题,我们提出了ReachAgent这一两阶段框架,通过页面到达与操作子任务及基于奖励的偏好GUI流程,显著提升了在步骤级和任务级上的准确率表现,优于当前最优代理。
English: To overcome the limitation of existing mobile AI agents that focus on local optimal solutions, we introduce ReachAgent, a two-stage framework utilizing page reaching and operation subtasks with reward-based preference GUI flows, which significantly outperforms the state-of-the-art agent in both step-level and task-level accuracy metrics.

Authors:Shuanghao Bai, Wanqi Zhou, Pengxiang Ding, Wei Zhao, Donglin Wang, Badong Chen
Title: Rethinking Latent Redundancy in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation
Abstract:
Behavior Cloning (BC) is a widely adopted visual imitation learning method in robot manipulation. Current BC approaches often enhance generalization by leveraging large datasets and incorporating additional visual and textual modalities to capture more diverse information. However, these methods overlook whether the learned representations contain redundant information and lack a solid theoretical foundation to guide the learning process. To address these limitations, we adopt an information-theoretic perspective and introduce mutual information to quantify and mitigate redundancy in latent representations. Building on this, we incorporate the Information Bottleneck (IB) principle into BC, which extends the idea of reducing redundancy by providing a structured framework for compressing irrelevant information while preserving task-relevant features. This work presents the first comprehensive study on redundancy in latent representations across various methods, backbones, and experimental settings, while extending the generalizability of the IB to BC. Extensive experiments and analyses on the CortexBench and LIBERO benchmarks demonstrate significant performance improvements with IB, underscoring the importance of reducing input data redundancy and highlighting its practical value for more practical applications. Project Page: https://baishuanghao.github.io/BC-IB.github.io.
中文摘要:本研究采用信息论视角,通过互信息和信息瓶颈原理减少行为克隆中潜在表征的冗余,在机器人操作基准测试中实现了显著性能提升。
English Summary: This study introduces an information-theoretic approach using mutual information and the Information Bottleneck principle to reduce redundancy in latent representations for Behavior Cloning, achieving significant performance improvements on robotics benchmarks.

Authors:Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang
Title: Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study
Abstract:
Large language models (LLMs) have shown continuously improving multilingual capabilities, and even small-scale open-source models have demonstrated rapid performance enhancement. In this paper, we systematically explore the abilities of open LLMs with less than ten billion parameters to handle multilingual machine translation (MT) tasks. We conduct comprehensive evaluations on six popular LLMs and find that models like Gemma2-9B exhibit impressive multilingual translation capabilities. We then introduce the Parallel-First Monolingual-Second (PFMS) data mixing strategy in the continual pretraining stage to further enhance the MT performance and present GemmaX2-28, a 9B model achieving top-tier multilingual translation performance across 28 languages. Specifically, GemmaX2-28 consistently outperforms the state-of-the-art (SOTA) models such as TowerInstruct and XALMA and achieves competitive performance with Google Translate and GPT-4-turbo.
中文: 不足百亿参数的开源大语言模型展现出卓越的多语言翻译能力,其中GemmaX2-28通过创新的并行优先数据混合策略,在28种语言翻译中实现了顶尖性能。
English: Small-scale open-source large language models under 10B parameters demonstrate strong multilingual translation capabilities, with GemmaX2-28 achieving state-of-the-art performance across 28 languages through a novel data mixing strategy.

Authors:Sheng Zhang, Qianchu Liu, Guanghui Qin, Tristan Naumann, Hoifung Poon
Title: Med-RLVR: Emerging Medical Reasoning from a 3B base model via reinforcement Learning
Abstract:
Reinforcement learning from verifiable rewards (RLVR) has recently gained attention for its ability to elicit self-evolved reasoning capabilitie from base language models without explicit reasoning supervisions, as demonstrated by DeepSeek-R1. While prior work on RLVR has primarily focused on mathematical and coding domains, its applicability to other tasks and domains remains unexplored. In this work, we investigate whether medical reasoning can emerge from RLVR. We introduce Med-RLVR as an initial study of RLVR in the medical domain leveraging medical multiple-choice question answering (MCQA) data as verifiable labels. Our results demonstrate that RLVR is not only effective for math and coding but also extends successfully to medical question answering. Notably, Med-RLVR achieves performance comparable to traditional supervised fine-tuning (SFT) on in-distribution tasks while significantly improving out-of-distribution generalization, with an 8-point accuracy gain. Further analysis of training dynamics reveals that, with no explicit reasoning supervision, reasoning emerges from the 3B-parameter base model. These findings underscore the potential of RLVR in domains beyond math and coding, opening new avenues for its application in knowledge-intensive fields such as medicine.
中文: RLVR成功扩展到医学推理领域,使30亿参数的基础模型在无需显式监督的情况下,实现了与监督微调相当的分布内性能,并在分布外泛化上获得8个百分点的准确率提升。
English: RLVR effectively extends to medical reasoning, enabling a 3B-parameter model to achieve comparable in-distribution performance and superior out-of-distribution generalization with an 8-point accuracy gain without explicit supervision.

Authors:Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao
Title: MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Abstract:
While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.
Chinese: MegaTTS 3采用创新的稀疏对齐算法,在提升零样本文本转语音的语音文本对齐效果和自然度的同时,实现了灵活的语音强度控制和高效生成。
English: MegaTTS 3 introduces a sparse alignment algorithm that enhances zero-shot text-to-speech performance by improving speech-text alignment and naturalness while enabling flexible accent control and efficient generation.

Authors:Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan Zhang, Jing Ma, See-Kiong Ng, Tat-Seng Chua
Title: FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models
Abstract:
Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs' fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs' factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.
中文:FACT-AUDIT是一个基于智能体的框架,通过生成自适应数据集并综合评估预测结论与论证过程,动态检验大语言模型的事实核查能力,从而对其事实推理进行全面审计。
English: FACT-AUDIT is an agent-driven framework that dynamically evaluates LLMs' fact-checking abilities by generating adaptive datasets and assessing both verdict predictions and justifications, offering a comprehensive audit of their factual reasoning.

Authors:Martin Kuo, Jingyang Zhang, Jianyi Zhang, Minxue Tang, Louis DiValentin, Aolin Ding, Jingwei Sun, William Chen, Amin Hass, Tianlong Chen, Yiran Chen, Hai Li
Title: Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility
Abstract:
With the rise of large language models (LLMs), increasing research has recognized their risk of leaking personally identifiable information (PII) under malicious attacks. Although efforts have been made to protect PII in LLMs, existing methods struggle to balance privacy protection with maintaining model utility. In this paper, inspired by studies of amnesia in cognitive science, we propose a novel approach, Proactive Privacy Amnesia (PPA), to safeguard PII in LLMs while preserving their utility. This mechanism works by actively identifying and forgetting key memories most closely associated with PII in sequences, followed by a memory implanting using suitable substitute memories to maintain the LLM's functionality. We conduct evaluations across multiple models to protect common PII, such as phone numbers and physical addresses, against prevalent PII-targeted attacks, demonstrating the superiority of our method compared with other existing defensive techniques. The results show that our PPA method completely eliminates the risk of phone number exposure by 100% and significantly reduces the risk of physical address exposure by 9.8% - 87.6%, all while maintaining comparable model utility performance.
中文: 提出的主动隐私遗忘(PPA)方法通过选择性遗忘敏感数据和植入替代记忆,有效保护大语言模型中的个人信息,在保持模型性能的同时完全防止电话号码泄露,并将地址暴露风险显著降低9.8%至87.6%。
English: The proposed Proactive Privacy Amnesia (PPA) method effectively protects personal information in large language models by selectively forgetting sensitive data and implanting substitute memories, achieving complete prevention of phone number leaks and significantly reducing address exposure risks while maintaining model utility.

Authors:Haoran Li, Wenbin Hu, Huihao Jing, Yulin Chen, Qi Hu, Sirui Han, Tianshu Chu, Peizhao Hu, Yangqiu Song
Title: PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance
Abstract:
Recent advancements in generative large language models (LLMs) have enabled wider applicability, accessibility, and flexibility. However, their reliability and trustworthiness are still in doubt, especially for concerns regarding individuals' data privacy. Great efforts have been made on privacy by building various evaluation benchmarks to study LLMs' privacy awareness and robustness from their generated outputs to their hidden representations. Unfortunately, most of these works adopt a narrow formulation of privacy and only investigate personally identifiable information (PII). In this paper, we follow the merit of the Contextual Integrity (CI) theory, which posits that privacy evaluation should not only cover the transmitted attributes but also encompass the whole relevant social context through private information flows. We present PrivaCI-Bench, a comprehensive contextual privacy evaluation benchmark targeted at legal compliance to cover well-annotated privacy and safety regulations, real court cases, privacy policies, and synthetic data built from the official toolkit to study LLMs' privacy and safety compliance. We evaluate the latest LLMs, including the recent reasoner models QwQ-32B and Deepseek R1. Our experimental results suggest that though LLMs can effectively capture key CI parameters inside a given context, they still require further advancements for privacy compliance.
Chinese: 近期生成式大语言模型应用广泛但隐私可靠性存疑,为此构建了PrivaCI-Bench上下文隐私评估基准,测试显示模型虽能理解语境参数,但在满足隐私合规性方面仍需提升。
English: Recent generative large language models (LLMs) show broad applicability but raise privacy concerns, leading to the development of PrivaCI-Bench, a contextual privacy benchmark that evaluates LLMs' compliance with legal standards, revealing their ability to grasp contextual integrity yet needing improvement for full privacy adherence.

Authors:Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, Bryan Hooi
Title: Can Indirect Prompt Injection Attacks Be Detected and Removed?
Abstract:
Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. To defend against such attacks, recent studies have developed various detection mechanisms. If we restrict ourselves specifically to works which perform detection rather than direct defense, most of them focus on direct prompt injection attacks, while there are few works for the indirect scenario, where injected instructions are indirectly from external tools, such as a search engine. Moreover, current works mainly investigate injection detection methods and pay less attention to the post-processing method that aims to mitigate the injection after detection. In this paper, we investigate the feasibility of detecting and removing indirect prompt injection attacks, and we construct a benchmark dataset for evaluation. For detection, we assess the performance of existing LLMs and open-source detection models, and we further train detection models using our crafted training datasets. For removal, we evaluate two intuitive methods: (1) the segmentation removal method, which segments the injected document and removes parts containing injected instructions, and (2) the extraction removal method, which trains an extraction model to identify and remove injected instructions.
中文摘要:提示注入攻击利用大语言模型的指令跟随特性执行恶意指令,当前研究主要集中于直接攻击检测而忽视间接场景及检测后处理,本研究专门针对间接提示注入开发了检测与清除技术并构建评估基准。
English Summary: Prompt injection attacks exploit LLMs' instruction-following nature to execute malicious commands, with current research primarily focusing on direct attack detection while neglecting indirect scenarios and post-detection mitigation methods, leading this study to develop detection and removal techniques specifically for indirect prompt injections.

Authors:Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, Bryan Hooi
Title: Can Indirect Prompt Injection Attacks Be Detected and Removed?
Abstract:
Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. To defend against such attacks, recent studies have developed various detection mechanisms. If we restrict ourselves specifically to works which perform detection rather than direct defense, most of them focus on direct prompt injection attacks, while there are few works for the indirect scenario, where injected instructions are indirectly from external tools, such as a search engine. Moreover, current works mainly investigate injection detection methods and pay less attention to the post-processing method that aims to mitigate the injection after detection. In this paper, we investigate the feasibility of detecting and removing indirect prompt injection attacks, and we construct a benchmark dataset for evaluation. For detection, we assess the performance of existing LLMs and open-source detection models, and we further train detection models using our crafted training datasets. For removal, we evaluate two intuitive methods: (1) the segmentation removal method, which segments the injected document and removes parts containing injected instructions, and (2) the extraction removal method, which trains an extraction model to identify and remove injected instructions.
中文摘要:提示注入攻击利用大语言模型的指令跟随特性执行恶意指令,当前研究主要集中于直接攻击检测而忽视间接场景及检测后处理,本研究专门针对间接提示注入开发了检测与清除技术并构建评估基准。
English Summary: Prompt injection attacks exploit LLMs' instruction-following nature to execute malicious commands, with current research primarily focusing on direct attack detection while neglecting indirect scenarios and post-detection mitigation methods, leading this study to develop detection and removal techniques specifically for indirect prompt injections.

Authors:Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, Tinoosh Mohsenin
Title: GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices
Abstract:
Generative Artificial Intelligence (GenAI) applies models and algorithms such as Large Language Model (LLM) and Foundation Model (FM) to generate new data. GenAI, as a promising approach, enables advanced capabilities in various applications, including text generation and image processing. In current practice, GenAI algorithms run mainly on the cloud server, leading to high latency and raising security concerns. Consequently, these challenges encourage the deployment of GenAI algorithms directly on edge devices. However, the large size of such models and their significant computational resource requirements pose obstacles when deploying them in resource-constrained systems. This survey provides a comprehensive overview of recent proposed techniques that optimize GenAI for efficient deployment on resource-constrained edge devices. For this aim, this work highlights three main categories for bringing GenAI to the edge: software optimization, hardware optimization, and frameworks. The main takeaways for readers of this survey will be a clear roadmap to design, implement, and refine GenAI systems for real-world implementation on edge devices.
Chinese: 生成式人工智能采用大型语言模型等技术生成新数据,当前主要部署于云端存在延迟和安全问题,因此转向边缘设备部署,需通过软件、硬件和框架优化解决资源受限系统的挑战。
English: Generative AI, which uses large models like LLMs to create new content, faces challenges in cloud deployment such as latency and security, prompting a shift toward edge deployment with optimizations in software, hardware, and frameworks to overcome resource constraints.

Authors:Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang, Anh Tuan Luu
Title: Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning
Abstract:
Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.
中文: Full-Step-DPO是一种新颖的框架,通过利用整个推理链的逐步奖励,结合自监督过程奖励模型和动态逐步DPO损失,显著提升了数学推理能力,并在多种基准测试中表现优异。
English: Full-Step-DPO is a novel framework that enhances mathematical reasoning by utilizing step-wise rewards from the entire reasoning chain, trained through a self-supervised process reward model and a dynamic step-wise DPO loss, achieving superior performance across benchmarks.

Authors:Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, Yiran Chen
Title: H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking
Abstract:
Large Reasoning Models (LRMs) have recently extended their powerful reasoning capabilities to safety checks-using chain-of-thought reasoning to decide whether a request should be answered. While this new approach offers a promising route for balancing model utility and safety, its robustness remains underexplored. To address this gap, we introduce Malicious-Educator, a benchmark that disguises extremely dangerous or malicious requests beneath seemingly legitimate educational prompts. Our experiments reveal severe security flaws in popular commercial-grade LRMs, including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1 model initially maintains a high refusal rate of about 98%, subsequent model updates significantly compromise its safety; and attackers can easily extract criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any additional tricks. To further highlight these vulnerabilities, we propose Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method that leverages the model's own displayed intermediate reasoning to jailbreak its safety reasoning mechanism. Under H-CoT, refusal rates sharply decline-dropping from 98% to below 2%-and, in some instances, even transform initially cautious tones into ones that are willing to provide harmful content. We hope these findings underscore the urgent need for more robust safety mechanisms to preserve the benefits of advanced reasoning capabilities without compromising ethical standards.
中文: 该研究通过恶意教育者基准和劫持思维链攻击方法,揭示了主流大型推理模型存在严重安全漏洞,导致拒绝率急剧下降并破坏安全机制。
English: The study introduces the Malicious-Educator benchmark and Hijacking Chain-of-Thought attack, revealing critical vulnerabilities in leading Large Reasoning Models that drastically reduce their refusal rates and compromise safety mechanisms.

Authors:Cong-Duy Nguyen, Xiaobao Wu, Duc Anh Vu, Shuai Zhao, Thong Nguyen, Anh Tuan Luu
Title: CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, but they remain susceptible to hallucination, particularly object hallucination where non-existent objects or incorrect attributes are fabricated in generated descriptions. Existing detection methods achieve strong performance but rely heavily on expensive API calls and iterative LVLM-based validation, making them impractical for large-scale or offline use. To address these limitations, we propose CutPaste\&Find, a lightweight and training-free framework for detecting hallucinations in LVLM-generated outputs. Our approach leverages off-the-shelf visual and linguistic modules to perform multi-step verification efficiently without requiring LVLM inference. At the core of our framework is a Visual-aid Knowledge Base that encodes rich entity-attribute relationships and associated image representations. We introduce a scaling factor to refine similarity scores, mitigating the issue of suboptimal alignment values even for ground-truth image-text pairs. Comprehensive evaluations on benchmark datasets, including POPE and R-Bench, demonstrate that CutPaste\&Find achieves competitive hallucination detection performance while being significantly more efficient and cost-effective than previous methods.
Chinese Summary: 提出的CutPaste&Find框架通过使用现成模块进行高效验证而无需大型视觉语言模型推理,为检测大视觉语言模型中的幻觉提供了一种轻量级、免训练的解决方案,在实现竞争性性能的同时显著提高了效率。
English Summary: The proposed CutPaste&Find framework provides a lightweight, training-free solution for detecting hallucinations in Large Vision-Language Models by using off-the-shelf modules for efficient verification without requiring LVLM inference, achieving competitive performance with greater efficiency.

Authors:Qiyuan Zhang, Yufei Wang, Yuxin Jiang, Liangyou Li, Chuhan Wu, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma
Title: Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge
Abstract:
LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning's inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed CoT judgment. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.
中文: LLM-as-a-Judge的可靠性因思维链推理不完整而受限,但提出的基于群体的比较评估方法通过揭示回答中更深层细节,有效提升了评估准确性并实现了更高效的监督微调。
English: LLM-as-a-Judge's reliability is limited by incomplete chain-of-thought reasoning, but the proposed crowd-based comparative evaluation enhances it by exposing deeper details in responses, improving accuracy and enabling more efficient supervised fine-tuning.

Authors:Jian Jia, Jingtong Gao, Ben Xue, Junhao Wang, Qingpeng Cai, Quan Chen, Xiangyu Zhao, Peng Jiang, Kun Gai
Title: From Principles to Applications: A Comprehensive Survey of Discrete Tokenizers in Generation, Comprehension, Recommendation, and Information Retrieval
Abstract:
Discrete tokenizers have emerged as indispensable components in modern machine learning systems, particularly within the context of autoregressive modeling and large language models (LLMs). These tokenizers serve as the critical interface that transforms raw, unstructured data from diverse modalities into discrete tokens, enabling LLMs to operate effectively across a wide range of tasks. Despite their central role in generation, comprehension, and recommendation systems, a comprehensive survey dedicated to discrete tokenizers remains conspicuously absent in the literature. This paper addresses this gap by providing a systematic review of the design principles, applications, and challenges of discrete tokenizers. We begin by dissecting the sub-modules of tokenizers and systematically demonstrate their internal mechanisms to provide a comprehensive understanding of their functionality and design. Building on this foundation, we synthesize state-of-the-art methods, categorizing them into multimodal generation and comprehension tasks, and semantic tokens for personalized recommendations. Furthermore, we critically analyze the limitations of existing tokenizers and outline promising directions for future research. By presenting a unified framework for understanding discrete tokenizers, this survey aims to guide researchers and practitioners in addressing open challenges and advancing the field, ultimately contributing to the development of more robust and versatile AI systems.
中文: 本文系统综述了离散分词器的设计原理、在多模态任务与推荐系统中的应用及挑战,旨在为未来研究提供统一框架,推动更稳健人工智能系统的发展。
English: This paper provides a systematic review of discrete tokenizers, examining their design principles, applications across multimodal tasks and recommendations, and challenges to guide future research in developing more robust AI systems.

Authors:Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu, Mengdi Zhang, Jian Shao, Yueting Zhuang
Title: MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Abstract:
Mathematical reasoning represents a critical frontier in advancing large language models (LLMs). While step-by-step approaches have emerged as the dominant paradigm for mathematical problem-solving in LLMs, the quality of reasoning steps in training data fundamentally constrains the performance of the models. Recent studies has demonstrated that more detailed intermediate steps can enhance model performance, yet existing methods for step expansion either require more powerful external models or incur substantial computational costs. In this paper, we introduce MathFimer, a novel framework for mathematical reasoning step expansion inspired by the "Fill-in-the-middle" task from code completion. By decomposing solution chains into prefix-suffix pairs and training models to reconstruct missing intermediate steps, we develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset. We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains, creating MathFimer-expanded versions. Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on original data across various benchmarks such as GSM8K and MATH. Our approach offers a practical, scalable solution for enhancing mathematical reasoning capabilities in LLMs without relying on powerful external models or expensive inference procedures.
中文: MathFimer框架通过训练模型重构详细的中间推理步骤,有效提升大型语言模型的数学推理能力,无需依赖外部模型或高昂计算成本即可在多个基准测试中实现更优表现。
English: The MathFimer framework enhances mathematical reasoning in large language models by training them to reconstruct detailed intermediate steps, improving performance across benchmarks without requiring external models or high computational costs.

Authors:Ziqiong Wang, Xiaoxue Yu, Rongpeng Li, Zhifeng Zhao
Title: Robust Event-Triggered Integrated Communication and Control with Graph Information Bottleneck Optimization
Abstract:
Integrated communication and control serves as a critical ingredient in Multi-Agent Reinforcement Learning. However, partial observability limitations will impair collaboration effectiveness, and a potential solution is to establish consensus through well-calibrated latent variables obtained from neighboring agents. Nevertheless, the rigid transmission of less informative content can still result in redundant information exchanges. Therefore, we propose a Consensus-Driven Event-Based Graph Information Bottleneck (CDE-GIB) method, which integrates the communication graph and information flow through a GIB regularizer to extract more concise message representations while avoiding the high computational complexity of inner-loop operations. To further minimize the communication volume required for establishing consensus during interactions, we also develop a variable-threshold event-triggering mechanism. By simultaneously considering historical data and current observations, this mechanism capably evaluates the importance of information to determine whether an event should be triggered. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art methods in terms of both efficiency and adaptability.
Chinese: 提出的共识驱动事件型图信息瓶颈(CDE-GIB)方法通过图正则化和事件触发机制优化多智能体间的信息交换,在效率和适应性方面均优于现有先进方法。
English: The proposed Consensus-Driven Event-Based Graph Information Bottleneck (CDE-GIB) method enhances multi-agent collaboration by optimizing information exchange through a graph-based regularizer and an event-triggering mechanism, achieving superior efficiency and adaptability over existing approaches.

Authors:Siddharth Singh, Prajwal Singhania, Aditya Ranjan, John Kirchenbauer, Jonas Geiping, Yuxin Wen, Neel Jain, Abhimanyu Hans, Manli Shu, Aditya Tomar, Tom Goldstein, Abhinav Bhatele
Title: Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers
Abstract:
Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We describe several performance optimizations in AxoNN to improve matrix multiply kernel performance, overlap non-blocking collectives with computation, and performance modeling to choose performance optimal configurations. These have resulted in unprecedented scaling and peak flop/s (bf16) for training of GPT-style transformer models on Perlmutter (620.1 Petaflop/s), Frontier (1.381 Exaflop/s) and Alps (1.423 Exaflop/s). While the abilities of LLMs improve with the number of trainable parameters, so do privacy and copyright risks caused by memorization of training data, which can cause disclosure of sensitive or private information at inference time. We highlight this side effect of scale through experiments that explore "catastrophic memorization", where models are sufficiently large to memorize training data in a single pass, and present an approach to prevent it. As part of this study, we demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
训练具有数万亿参数的大型语言模型需要巨大的计算资源,本研究提出了可扩展框架AxoNN,在超级计算机上实现了破纪录的性能,同时通过防止训练数据灾难性记忆的技术来解决隐私风险。
Training large language models (LLMs) with trillions of parameters demands immense computational resources, and this work introduces AxoNN, a scalable framework that achieves record-breaking performance on supercomputers while addressing privacy risks through techniques to prevent catastrophic memorization of training data.

Authors:Guibin Zhang, Kaijie Chen, Guancheng Wan, Heng Chang, Hong Cheng, Kun Wang, Shuyue Hu, Lei Bai
Title: EvoFlow: Evolving Diverse Agentic Workflows On The Fly
Abstract:
The past two years have witnessed the evolution of large language model (LLM)-based multi-agent systems from labor-intensive manual design to partial automation (\textit{e.g.}, prompt engineering, communication topology) and eventually to fully automated design. However, existing agentic automation pipelines often lack LLM heterogeneity and focus on single-objective performance optimization, limiting their potential to combine weaker models for more customized and cost-effective solutions. To address this challenge, we propose EvoFlow, a niching evolutionary algorithm-based framework to automatically search a population of heterogeneous and complexity-adaptive agentic workflows, rather than a single homogeneous, complex workflow. Technically, EvoFlow performs \textit{(1) tag-based retrieval} to extract parent workflows from an agentic population, evolves new workflows through \textit{(2) crossover} and \textit{(3) mutation}, and employs \textit{(4) niching-based selection} to maintain population diversity and quality. Extensive evaluations across seven benchmarks demonstrate that EvoFlow is: \textbf{(I) diverse}, evolving a population of workflows ranging from simple I/O tasks to complex multi-turn interactions; \textbf{(II) high-performing}, outperforming previous handcrafted and automated workflows by $1.23\%\sim29.86\%$; \textbf{(III) economical}, surpassing powerful \llmname{o1-preview} at $12.4\%$ of its inference cost using weaker open-source models.
中文摘要:EvoFlow提出了一种基于生态位进化算法的框架,能自动搜索异构且复杂度自适应的群体工作流,通过组合较弱模型实现了性能提升与成本优化,在七个基准测试中展现出卓越表现。
English Summary: EvoFlow introduces a niching evolutionary algorithm to automatically generate diverse populations of heterogeneous, complexity-adaptive agentic workflows, outperforming existing methods in performance and cost-efficiency by combining weaker models effectively.

Authors:Jialong Zuo, Shengpeng Ji, Minghui Fang, Ziyue Jiang, Xize Cheng, Qian Yang, Wenrui Liu, Guangyan Zhang, Zehai Tu, Yiwen Guo, Zhou Zhao
Title: Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model
Abstract:
This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre similarity by combining global timbre embeddings with time-varying timbre tokens. Experiments on unseen LibriTTS test-clean and emotional speech dataset ESD show the superiority of the PFlow-VC model in both timbre conversion and style transfer. Audio samples are available on the demo page https://speechai-demo.github.io/PFlow-VC/.
中文摘要:本文提出PFlow-VC语音转换模型,通过离散音高标记和说话人提示信息增强音色转换与风格迁移的表现力,在测试数据集上展现出优越性能。
English Summary: This paper presents PFlow-VC, a voice conversion model that uses discrete pitch tokens and speaker prompts to enhance expressiveness in both timbre conversion and style transfer, demonstrating superior performance on test datasets.

Authors:Steffen Eger, Yong Cao, Jennifer D'Souza, Andreas Geiger, Christian Greisinger, Stephanie Gross, Yufang Hou, Brigitte Krenn, Anne Lauscher, Yizhi Li, Chenghua Lin, Nafise Sadat Moosavi, Wei Zhao, Tristan Miller
Title: Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation
Abstract:
With the advent of large multimodal language models, science is now at a threshold of an AI-based technological transformation. Recently, a plethora of new AI models and tools has been proposed, promising to empower researchers and academics worldwide to conduct their research more effectively and efficiently. This includes all aspects of the research cycle, especially (1) searching for relevant literature; (2) generating research ideas and conducting experimentation; generating (3) text-based and (4) multimodal content (e.g., scientific figures and diagrams); and (5) AI-based automatic peer review. In this survey, we provide an in-depth overview over these exciting recent developments, which promise to fundamentally alter the scientific research process for good. Our survey covers the five aspects outlined above, indicating relevant datasets, methods and results (including evaluation) as well as limitations and scope for future research. Ethical concerns regarding shortcomings of these tools and potential for misuse (fake science, plagiarism, harms to research integrity) take a particularly prominent place in our discussion. We hope that our survey will not only become a reference guide for newcomers to the field but also a catalyst for new AI-based initiatives in the area of "AI4Science".
大型多模态语言模型正通过提升从文献检索到同行评审等整个科研周期的效率,推动科学研究的革命性变革,同时也引发了关于其潜在误用的重要伦理思考。
Large multimodal language models are poised to revolutionize scientific research by enhancing efficiency across the entire research cycle, from literature review to peer review, while raising important ethical considerations regarding their potential misuse.

Authors:Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, Xiang Wang
Title: Multi-agent Architecture Search via Agentic Supernet
Abstract:
Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\sim45\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\%\sim11.82\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.
中文: 提出的MaAS框架通过从概率性智能体超网中动态生成针对特定查询的多智能体系统,相比现有系统可减少55-94%的推理成本,同时性能提升最高达11.82%。
English: The proposed MaAS framework dynamically generates query-specific multi-agent systems from a probabilistic agentic supernet, significantly reducing inference costs by 55-94% while improving performance by up to 11.82% compared to existing systems.

Authors:Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vulić, Anna Korhonen, Sercan Ö. Arık
Title: Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies
Abstract:
Large language models, employed as multiple agents that interact and collaborate with each other, have excelled at solving complex tasks. The agents are programmed with prompts that declare their functionality, along with the topologies that orchestrate interactions across agents. Designing prompts and topologies for multi-agent systems (MAS) is inherently complex. To automate the entire design process, we first conduct an in-depth analysis of the design space aiming to understand the factors behind building effective MAS. We reveal that prompts together with topologies play critical roles in enabling more effective MAS design. Based on the insights, we propose Multi-Agent System Search (MASS), a MAS optimization framework that efficiently exploits the complex MAS design space by interleaving its optimization stages, from local to global, from prompts to topologies, over three stages: 1) block-level (local) prompt optimization; 2) workflow topology optimization; 3) workflow-level (global) prompt optimization, where each stage is conditioned on the iteratively optimized prompts/topologies from former stages. We show that MASS-optimized multi-agent systems outperform a spectrum of existing alternatives by a substantial margin. Based on the MASS-found systems, we finally propose design principles behind building effective multi-agent systems.
中文摘要:多智能体系统搜索(MASS)框架通过三个交错阶段优化提示和拓扑结构,自动设计多智能体系统,最终实现的系统性能远超现有方案。
English Summary: The Multi-Agent System Search (MASS) framework automates the design of multi-agent systems by optimizing prompts and topologies through three interleaved stages, resulting in systems that significantly outperform existing alternatives.

Authors:Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen
Title: Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers
Abstract:
Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.
中文: 本文提出汉明注意力蒸馏(HAD)框架,通过二值化注意力键值对实现高效汉明距离计算与注意力稀疏化,在显著降低计算成本和硬件需求的同时保持了接近完整的模型性能。
English: This paper introduces Hamming Attention Distillation (HAD), a novel framework that binarizes attention keys and queries to enable efficient Hamming distance computations and attention sparsification, achieving near-full performance with drastically reduced computational costs and hardware requirements.

Authors:Cliff Wong, Sam Preston, Qianchu Liu, Zelalem Gero, Jaspreet Bagga, Sheng Zhang, Shrey Jain, Theodore Zhao, Yu Gu, Yanbo Xu, Sid Kiblawi, Srinivasan Yegnasubramanian, Taxiarchis Botsis, Marvin Borja, Luis M. Ahumada, Joseph C. Murray, Guo Hui Gan, Roshanthi Weerasinghe, Kristina Young, Rom Leidner, Brian Piening, Carlo Bifulco, Tristan Naumann, Mu Wei, Hoifung Poon
Title: Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale
Abstract:
A significant fraction of real-world patient information resides in unstructured clinical text. Medical abstraction extracts and normalizes key structured attributes from free-text clinical notes, which is the prerequisite for a variety of important downstream applications, including registry curation, clinical trial operations, and real-world evidence generation. Prior medical abstraction methods typically resort to building attribute-specific models, each of which requires extensive manual effort such as rule creation or supervised label annotation for the individual attribute, thus limiting scalability. In this paper, we show that existing frontier models already possess the universal abstraction capability for scaling medical abstraction to a wide range of clinical attributes. We present UniMedAbstractor (UMA), a unifying framework for zero-shot medical abstraction with a modular, customizable prompt template and the selection of any frontier large language models. Given a new attribute for abstraction, users only need to conduct lightweight prompt adaptation in UMA to adjust the specification in natural languages. Compared to traditional methods, UMA eliminates the need for attribute-specific training labels or handcrafted rules, thus substantially reducing the development time and cost. We conducted a comprehensive evaluation of UMA in oncology using a wide range of marquee attributes representing the cancer patient journey. These include relatively simple attributes typically specified within a single clinical note (e.g. performance status), as well as complex attributes requiring sophisticated reasoning across multiple notes at various time points (e.g. tumor staging). Based on a single frontier model such as GPT-4o, UMA matched or even exceeded the performance of state-of-the-art attribute-specific methods, each of which was tailored to the individual attribute.
中文摘要:UniMedAbstractor (UMA) 提出了一个基于前沿语言模型的通用医疗摘要框架,无需针对特定属性进行训练即可实现零样本信息提取,在显著降低开发成本的同时保持了与传统专项方法相当的性能。
English summary: UniMedAbstractor (UMA) introduces a universal framework using frontier language models for zero-shot medical abstraction, eliminating the need for attribute-specific training and substantially reducing development costs while matching specialized methods' performance.

Authors:Jiawen Zhang, Kejia Chen, Zunlei Feng, Jian Lou, Mingli Song, Jian Liu, Xiaohu Yang
Title: SecPE: Secure Prompt Ensembling for Private and Robust Large Language Models
Abstract:
With the growing popularity of LLMs among the general public users, privacy-preserving and adversarial robustness have become two pressing demands for LLM-based services, which have largely been pursued separately but rarely jointly. In this paper, to the best of our knowledge, we are among the first attempts towards robust and private LLM inference by tightly integrating two disconnected fields: private inference and prompt ensembling. The former protects users' privacy by encrypting inference data transmitted and processed by LLMs, while the latter enhances adversarial robustness by yielding an aggregated output from multiple prompted LLM responses. Although widely recognized as effective individually, private inference for prompt ensembling together entails new challenges that render the naive combination of existing techniques inefficient. To overcome the hurdles, we propose SecPE, which designs efficient fully homomorphic encryption (FHE) counterparts for the core algorithmic building blocks of prompt ensembling. We conduct extensive experiments on 8 tasks to evaluate the accuracy, robustness, and efficiency of SecPE. The results show that SecPE maintains high clean accuracy and offers better robustness at the expense of merely $2.5\%$ efficiency overhead compared to baseline private inference methods, indicating a satisfactory ``accuracy-robustness-efficiency'' tradeoff. For the efficiency of the encrypted Argmax operation that incurs major slowdown for prompt ensembling, SecPE is 35.4x faster than the state-of-the-art peers, which can be of independent interest beyond this work.
中文: 本文提出SecPE方法,通过融合私有推理和提示集成技术,在保护用户隐私的同时增强大语言模型的抗攻击能力,实验表明其能以微小效率代价实现优异的准确性与鲁棒性平衡。
English: This paper introduces SecPE, a novel approach that integrates private inference and prompt ensembling to achieve both privacy protection and adversarial robustness in LLM services, demonstrating high accuracy and efficiency with minimal overhead.

Authors:Jiawen Zhang, Kejia Chen, Lipeng He, Jian Lou, Dan Li, Zunlei Feng, Mingli Song, Jian Liu, Kui Ren, Xiaohu Yang
Title: Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense
Abstract:
Large Language Models (LLMs) have showcased remarkable capabilities across various domains. Accompanying the evolving capabilities and expanding deployment scenarios of LLMs, their deployment challenges escalate due to their sheer scale and the advanced yet complex activation designs prevalent in notable model series, such as Llama, Gemma, Mistral. These challenges have become particularly pronounced in resource-constrained deployment scenarios, where mitigating inference bottlenecks is imperative. Among various recent efforts, activation approximation has emerged as a promising avenue for pursuing inference efficiency, sometimes considered indispensable in applications such as private inference. Despite achieving substantial speedups with minimal impact on utility, even appearing sound and practical for real-world deployment, the safety implications of activation approximations remain unclear. In this work, we fill this critical gap in LLM safety by conducting the first systematic safety evaluation of activation approximations. Our safety vetting spans seven state-of-the-art techniques across three popular categories (activation polynomialization, activation sparsification, and activation quantization), revealing consistent safety degradation across ten safety-aligned LLMs. To overcome the hurdle of devising a unified defense accounting for diverse activation approximation methods, we perform an in-depth analysis of their shared error patterns and uncover three key findings. We propose QuadA, a novel safety enhancement method tailored to mitigate the safety compromises introduced by activation approximations. Extensive experiments and ablation studies corroborate QuadA's effectiveness in enhancing the safety capabilities of LLMs after activation approximations.
中文: 本研究首次系统评估了大语言模型中激活近似技术对安全性的影响,揭示了其导致的安全性能下降问题,并提出新型防御方法QuadA能有效缓解这些风险。
English: This study conducts the first systematic safety evaluation of activation approximation techniques in Large Language Models, revealing consistent safety degradation and proposing QuadA, a novel defense method that effectively mitigates these risks.

Authors:Haoran Zhang, Yong Liu, Yunzhong Qiu, Haixuan Liu, Zhongyi Pei, Jianmin Wang, Mingsheng Long
Title: TimesBERT: A BERT-Style Foundation Model for Time Series Understanding
Abstract:
Time series analysis is crucial in diverse scenarios. Beyond forecasting, considerable real-world tasks are categorized into classification, imputation, and anomaly detection, underscoring different capabilities termed time series understanding in this paper. While GPT-style models have been positioned as foundation models for time series forecasting, the BERT-style architecture, which has made significant advances in natural language understanding, has not been fully unlocked for time series understanding, possibly attributed to the undesirable dropout of essential elements of BERT. In this paper, inspired by the shared multi-granularity structure between multivariate time series and multisentence documents, we design TimesBERT to learn generic representations of time series including temporal patterns and variate-centric characteristics. In addition to a natural adaptation of masked modeling, we propose a parallel task of functional token prediction to embody vital multi-granularity structures. Our model is pre-trained on 260 billion time points across diverse domains. Leveraging multi-granularity representations, TimesBERT achieves state-of-the-art performance across four typical downstream understanding tasks, outperforming task-specific models and language pre-trained backbones, positioning it as a versatile foundation model for time series understanding.
中文: 本文提出TimesBERT模型,通过多粒度表征学习时间序列的通用特征,在四大下游理解任务中实现最优性能,成为时间序列理解的通用基础模型。
English: This paper introduces TimesBERT, a BERT-style foundation model for time series understanding that leverages multi-granularity representations and achieves state-of-the-art performance across classification, imputation, anomaly detection, and forecasting tasks.

Authors:Nanshan Deng, Weitao Zhou, Bo Zhang, Junze Wen, Kun Jiang, Zhong Cao, Diange Yang
Title: Dynamically Local-Enhancement Planner for Large-Scale Autonomous Driving
Abstract:
Current autonomous vehicles operate primarily within limited regions, but there is increasing demand for broader applications. However, as models scale, their limited capacity becomes a significant challenge for adapting to novel scenarios. It is increasingly difficult to improve models for new situations using a single monolithic model. To address this issue, we introduce the concept of dynamically enhancing a basic driving planner with local driving data, without permanently modifying the planner itself. This approach, termed the Dynamically Local-Enhancement (DLE) Planner, aims to improve the scalability of autonomous driving systems without significantly expanding the planner's size. Our approach introduces a position-varying Markov Decision Process formulation coupled with a graph neural network that extracts region-specific driving features from local observation data. The learned features describe the local behavior of the surrounding objects, which is then leveraged to enhance a basic reinforcement learning-based policy. We evaluated our approach in multiple scenarios and compared it with a one-for-all driving model. The results show that our method outperforms the baseline policy in both safety (collision rate) and average reward, while maintaining a lighter scale. This approach has the potential to benefit large-scale autonomous vehicles without the need for largely expanding on-device driving models.
中文摘要:动态局部增强(DLE)规划器通过利用本地数据优化基础驾驶规划器,无需扩大模型规模即可提升自动驾驶系统在新场景中的安全性和性能表现。
English Summary: The Dynamically Local-Enhancement (DLE) Planner enhances autonomous driving systems by adapting a basic planner to novel scenarios using local data, improving safety and performance without significantly increasing model size.

Authors:Zhengxuan Zhang, Yin Wu, Yuyu Luo, Nan Tang
Title: Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
Abstract:
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. To address these challenges, this study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments (e.g. text fragments, entity images, and so on) in a structured manner. Rather than merely refining retrieval mechanisms, we prioritize the systematic organization and management of these knowledge units, ensuring that the structuring process itself enhances retrieval quality. Second, we propose a knowledge unit retrieval-augmented generation framework (KU-RAG) that seamlessly integrates fine-grained retrieval with MLLMs. Our KU-RAG framework not only ensures precise retrieval of relevant knowledge but also enhances reasoning capabilities through a knowledge correction chain. Experimental results demonstrate that our approach consistently outperforms existing KB-VQA methods across four benchmarks, achieving an average improvement of approximately 3% and up to 11% in the best case.
中文: 本研究通过引入细粒度多模态知识单元和知识单元检索增强生成框架,提升了视觉问答任务中的知识检索与推理能力,在多个基准测试中显著优于现有方法。
English: This study introduces fine-grained multimodal knowledge units and a knowledge unit retrieval-augmented generation (KU-RAG) framework to enhance Visual Question Answering by improving knowledge retrieval and reasoning, achieving significant performance gains over existing methods.

Authors:Hanyang Kong, Xingyi Yang, Xinchao Wang
Title: Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling
Abstract:
Rendering dynamic scenes from monocular videos is a crucial yet challenging task. The recent deformable Gaussian Splatting has emerged as a robust solution to represent real-world dynamic scenes. However, it often leads to heavily redundant Gaussians, attempting to fit every training view at various time steps, leading to slower rendering speeds. Additionally, the attributes of Gaussians in static areas are time-invariant, making it unnecessary to model every Gaussian, which can cause jittering in static regions. In practice, the primary bottleneck in rendering speed for dynamic scenes is the number of Gaussians. In response, we introduce Efficient Dynamic Gaussian Splatting (EDGS), which represents dynamic scenes via sparse time-variant attribute modeling. Our approach formulates dynamic scenes using a sparse anchor-grid representation, with the motion flow of dense Gaussians calculated via a classical kernel representation. Furthermore, we propose an unsupervised strategy to efficiently filter out anchors corresponding to static areas. Only anchors associated with deformable objects are input into MLPs to query time-variant attributes. Experiments on two real-world datasets demonstrate that our EDGS significantly improves the rendering speed with superior rendering quality compared to previous state-of-the-art methods.
中文: 提出的高效动态高斯泼溅(EDGS)方法通过稀疏时变属性建模和运动流计算,减少了动态场景渲染中的冗余高斯分布,在保持优于现有技术渲染质量的同时显著提升了渲染速度。
English: The proposed Efficient Dynamic Gaussian Splatting (EDGS) method reduces redundant Gaussians in dynamic scene rendering by using sparse time-variant attribute modeling and motion flow calculation, significantly improving rendering speed while maintaining superior quality compared to existing techniques.

Authors:Minggui He, Yilun Liu, Shimin Tao, Yuanchang Luo, Hongyong Zeng, Chang Su, Li Zhang, Hongxia Ma, Daimeng Wei, Weibin Meng, Hao Yang, Boxing Chen, Osamu Yoshie
Title: R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning
Abstract:
Despite recent breakthroughs in reasoning-enhanced large language models (LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine translation (MT), where human translators naturally employ structured, multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored. Existing methods either design a fixed CoT tailored for a specific MT sub-task (e.g., literature translation), or rely on synthesizing CoTs unaligned with humans and supervised fine-tuning (SFT) prone to overfitting, limiting their adaptability to diverse translation scenarios. This paper introduces R1-Translator (R1-T1), a novel framework to achieve inference-time reasoning for general MT via reinforcement learning (RL) with human-aligned CoTs comprising six common patterns. Our approach pioneers three innovations: (1) extending reasoning-based translation to broader MT scenarios (e.g., multilingual MT, domain MT) unseen in the training phase; (2) formalizing six expert-curated CoT templates that mirror hybrid human strategies like context-aware paraphrasing and back translation; and (3) enabling self-evolving CoT discovery through RL. Both human and automatic evaluation results indicate a steady translation performance improvement in a total of 10+ languages and 40+ translation directions on Flores-101 test set and four domain-specific MT tasks, especially on the languages unseen from training.
中文摘要:本文提出R1-Translator框架,通过结合人类对齐的思维链模式与强化学习,在多种语言和领域翻译任务中实现了稳定性能提升。
English Summary: This paper introduces R1-Translator, a novel framework that enhances machine translation through reinforcement learning with human-aligned reasoning patterns, demonstrating improved performance across diverse languages and domains.

Authors:Mingsheng Cai, Jiuming Jiang, Wenhao Huang, Che Liu, Rossella Arcucci
Title: SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning
Abstract:
Cardiovascular diseases are a leading cause of death and disability worldwide. Electrocardiogram (ECG) is critical for diagnosing and monitoring cardiac health, but obtaining large-scale annotated ECG datasets is labor-intensive and time-consuming. Recent ECG Self-Supervised Learning (eSSL) methods mitigate this by learning features without extensive labels but fail to capture fine-grained clinical semantics and require extensive task-specific fine-tuning. To address these challenges, we propose $\textbf{SuPreME}$, a $\textbf{Su}$pervised $\textbf{Pre}$-training framework for $\textbf{M}$ultimodal $\textbf{E}$CG representation learning. SuPreME is pre-trained using structured diagnostic labels derived from ECG report entities through a one-time offline extraction with Large Language Models (LLMs), which help denoise, standardize cardiac concepts, and improve clinical representation learning. By fusing ECG signals with textual cardiac queries instead of fixed labels, SuPreME enables zero-shot classification of unseen conditions without further fine-tuning. We evaluate SuPreME on six downstream datasets covering 106 cardiac conditions, achieving superior zero-shot AUC performance of $77.20\%$, surpassing state-of-the-art eSSLs by $4.98\%$. Results demonstrate SuPreME's effectiveness in leveraging structured, clinically relevant knowledge for high-quality ECG representations.
中文: 提出的SuPreME框架利用经大型语言模型处理的结构化诊断标签进行监督式预训练,构建多模态心电图表征,无需微调即可实现卓越的零样本分类性能。
English: The proposed SuPreME framework uses supervised pre-training with LLM-processed diagnostic labels to create multimodal ECG representations, achieving superior zero-shot classification performance without requiring fine-tuning.

Authors:Yu Liu, Baoxiong Jia, Ruijie Lu, Junfeng Ni, Song-Chun Zhu, Siyuan Huang
Title: ArtGS: Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting
Abstract:
Building articulated objects is a key challenge in computer vision. Existing methods often fail to effectively integrate information across different object states, limiting the accuracy of part-mesh reconstruction and part dynamics modeling, particularly for complex multi-part articulated objects. We introduce ArtGS, a novel approach that leverages 3D Gaussians as a flexible and efficient representation to address these issues. Our method incorporates canonical Gaussians with coarse-to-fine initialization and updates for aligning articulated part information across different object states, and employs a skinning-inspired part dynamics modeling module to improve both part-mesh reconstruction and articulation learning. Extensive experiments on both synthetic and real-world datasets, including a new benchmark for complex multi-part objects, demonstrate that ArtGS achieves state-of-the-art performance in joint parameter estimation and part mesh reconstruction. Our approach significantly improves reconstruction quality and efficiency, especially for multi-part articulated objects. Additionally, we provide comprehensive analyses of our design choices, validating the effectiveness of each component to highlight potential areas for future improvement. Our work is made publicly available at: https://articulate-gs.github.io.
Chinese: ArtGS 提出了一种新颖的基于3D高斯的方法,通过从粗到细的初始化和仿皮肤模块显著改进了复杂铰接物体的部件网格重建和关节学习,在联合参数估计方面达到了最先进的性能。
English: ArtGS introduces a novel 3D Gaussian-based representation with coarse-to-fine initialization and a skinning-inspired module to significantly enhance part-mesh reconstruction and articulation learning for complex articulated objects, achieving state-of-the-art performance in joint parameter estimation.

Authors:Teng Lin, Yuyu Luo, Honglin Zhang, Jicheng Zhang, Chunlin Liu, Kaishun Wu, Nan Tang
Title: MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
Abstract:
Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like "What is the distribution of ACM Fellows among various fields of study?", which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs' capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4, Llama-3) and RAG pipelines reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.
中文: MEBench作为一个新型多文档基准测试,揭示了当前大语言模型和检索增强生成系统在多实体问答中的严重不足——即便顶级模型在整合跨文档分散的实体信息时准确率也仅达59%。
English: MEBench is a new multi-document benchmark that exposes significant limitations in current LLMs and RAG systems for multi-entity question answering, where even top models achieve only 59% accuracy in consolidating scattered entity information across documents.

Authors:Chuanguang Yang, Xinqiang Yu, Han Yang, Zhulin An, Chengqing Yu, Libo Huang, Yongjun Xu
Title: Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition
Abstract:
Multi-teacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network. The core problem of multi-teacher KD is how to balance distillation strengths among various teachers. Most existing methods often develop weighting strategies from an individual perspective of teacher performance or teacher-student gaps, lacking comprehensive information for guidance. This paper proposes Multi-Teacher Knowledge Distillation with Reinforcement Learning (MTKD-RL) to optimize multi-teacher weights. In this framework, we construct both teacher performance and teacher-student gaps as state information to an agent. The agent outputs the teacher weight and can be updated by the return reward from the student. MTKD-RL reinforces the interaction between the student and teacher using an agent in an RL-based decision mechanism, achieving better matching capability with more meaningful weights. Experimental results on visual recognition tasks, including image classification, object detection, and semantic segmentation tasks, demonstrate that MTKD-RL achieves state-of-the-art performance compared to the existing multi-teacher KD works.
中文: 本文提出MTKD-RL方法,通过强化学习整合教师表现和师生差异信息来优化多教师知识蒸馏中的权重分配,在多项视觉识别任务中实现了最优性能。
English: This paper introduces MTKD-RL, a reinforcement learning-based method that optimizes teacher weighting in multi-teacher knowledge distillation by integrating both teacher performance and teacher-student gap information, achieving state-of-the-art results across various visual recognition tasks.

Authors:Jian Wu, Jiayu Zhang, Dongyuan Li, Linyi Yang, Aoxiao Zhong, Renhe Jiang, Qingsong Wen, Yue Zhang
Title: LAG: LLM agents for Leaderboard Auto Generation on Demanding
Abstract:
This paper introduces Leaderboard Auto Generation (LAG), a novel and well-organized framework for automatic generation of leaderboards on a given research topic in rapidly evolving fields like Artificial Intelligence (AI). Faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper's proposed methods, experimental results, and settings, prompting the need for efficient automatic leaderboard construction. While large language models (LLMs) offer promise in automating this process, challenges such as multi-document summarization, leaderboard generation, and experiment fair comparison still remain under exploration. LAG solves these challenges through a systematic approach that involves the paper collection, experiment results extraction and integration, leaderboard generation, and quality evaluation. Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.
Chinese: 本文介绍了Leaderboard Auto Generation (LAG),这是一个新颖且结构化的框架,通过收集论文、提取结果并评估质量,自动生成人工智能等快速发展领域的排行榜,有效解决了多文档摘要和公平比较的挑战。
English: This paper presents Leaderboard Auto Generation (LAG), a systematic framework that automates the creation of leaderboards for rapidly advancing fields like AI by collecting papers, extracting results, and evaluating quality to address challenges in multi-document summarization and fair comparisons.

Authors:Jian Wu, Jiayu Zhang, Dongyuan Li, Linyi Yang, Aoxiao Zhong, Renhe Jiang, Qingsong Wen, Yue Zhang
Title: League: Leaderboard Generation on Demand
Abstract:
This paper introduces Leaderboard Auto Generation (LAG), a novel and well-organized framework for automatic generation of leaderboards on a given research topic in rapidly evolving fields like Artificial Intelligence (AI). Faced with a large number of AI papers updated daily, it becomes difficult for researchers to track every paper's proposed methods, experimental results, and settings, prompting the need for efficient automatic leaderboard construction. While large language models (LLMs) offer promise in automating this process, challenges such as multi-document summarization, leaderboard generation, and experiment fair comparison still remain under exploration. LAG solves these challenges through a systematic approach that involves the paper collection, experiment results extraction and integration, leaderboard generation, and quality evaluation. Our contributions include a comprehensive solution to the leaderboard construction problem, a reliable evaluation method, and experimental results showing the high quality of leaderboards.
Chinese: 本文介绍了Leaderboard Auto Generation (LAG),这是一个新颖且结构化的框架,通过收集论文、提取结果并评估质量,自动生成人工智能等快速发展领域的排行榜,有效解决了多文档摘要和公平比较的挑战。
English: This paper presents Leaderboard Auto Generation (LAG), a systematic framework that automates the creation of leaderboards for rapidly advancing fields like AI by collecting papers, extracting results, and evaluating quality to address challenges in multi-document summarization and fair comparisons.

Authors:Che Liu, Cheng Ouyang, Zhongwei Wan, Haozhe Wang, Wenjia Bai, Rossella Arcucci
Title: Knowledge-enhanced Multimodal ECG Representation Learning with Arbitrary-Lead Inputs
Abstract:
Recent advances in multimodal ECG representation learning center on aligning ECG signals with paired free-text reports. However, suboptimal alignment persists due to the complexity of medical language and the reliance on a full 12-lead setup, which is often unavailable in under-resourced settings. To tackle these issues, we propose **K-MERL**, a knowledge-enhanced multimodal ECG representation learning framework. **K-MERL** leverages large language models to extract structured knowledge from free-text reports and employs a lead-aware ECG encoder with dynamic lead masking to accommodate arbitrary lead inputs. Evaluations on six external ECG datasets show that **K-MERL** achieves state-of-the-art performance in zero-shot classification and linear probing tasks, while delivering an average **16%** AUC improvement over existing methods in partial-lead zero-shot classification.
中文: 提出的K-MERL框架通过从报告中提取结构化知识并采用带动态导联掩码的导联感知编码器,提升了多模态心电图表征学习,在部分导联分类中实现了最优性能及16%的AUC提升。
English: The proposed K-MERL framework enhances multimodal ECG representation learning by integrating structured knowledge from reports and a lead-aware encoder with dynamic masking, achieving state-of-the-art performance and a 16% AUC improvement in partial-lead classification.

Authors:Boyan Li, Jiayi Zhang, Ju Fan, Yanwei Xu, Chong Chen, Nan Tang, Yuyu Luo
Title: Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search
Abstract:
Text-to-SQL, which enables natural language interaction with databases, serves as a pivotal method across diverse industries. With new, more powerful large language models (LLMs) emerging every few months, fine-tuning has become incredibly costly, labor-intensive, and error-prone. As an alternative, zero-shot Text-to-SQL, which leverages the growing knowledge and reasoning capabilities encoded in LLMs without task-specific fine-tuning, presents a promising and more challenging direction. To address this challenge, we propose Alpha-SQL, a novel approach that leverages a Monte Carlo Tree Search (MCTS) framework to iteratively infer SQL construction actions based on partial reasoning states. To enhance the framework's reasoning capabilities, we introduce LLM-as-Action-Model to dynamically generate SQL construction actions during the MCTS process, steering the search toward more promising SQL queries. Moreover, Alpha-SQL employs a self-supervised reward function to evaluate the quality of candidate SQL queries, ensuring more accurate and efficient query generation. Experimental results show that Alpha-SQL achieves 69.7% execution accuracy on the BIRD development set, using a 32B open-source LLM without fine-tuning. Alpha-SQL outperforms the best previous zero-shot approach based on GPT-4o by 2.5% on the BIRD development set.
中文摘要:Alpha-SQL提出了一种基于蒙特卡洛树搜索的零样本Text-to-SQL新方法,通过大语言模型动态生成SQL构建动作和自监督奖励机制,在BIRD基准测试中以69.7%的执行准确率超越GPT-4o方法2.5%,且无需微调。
English Summary: Alpha-SQL introduces a novel zero-shot Text-to-SQL approach using Monte Carlo Tree Search with LLM-guided action generation and self-supervised rewards, achieving 69.7% accuracy on BIRD benchmark and outperforming GPT-4o by 2.5% without fine-tuning.

Authors:Jiaxin Guo, Daimeng Wei, Zongyao Li, Hengchao Shang, Yuanchang Luo, Hao Yang
Title: Chain-of-Description: What I can understand, I can put into words
Abstract:
In this paper, we propose a novel strategy defined as Chain-of-Description (CoD) Prompting, tailored for Multi-Modal Large Language Models. This approach involves having the model first provide a detailed description of the multi-modal input before generating an answer to the question. When applied to models such as Qwen2-Audio, Qwen2-VL, and Qwen2.5-VL, CoD Prompting significantly enhances performance compared to standard prompting methods. This is demonstrated by nearly a 4\% improvement in the speech category of the audio benchmark AIR-Bench-Chat and a 5.3\% improvement in the hard-level portion of the vision benchmark MMMU\_Pro. Our ablation study further validates the effectiveness of CoD Prompting.
中文: 本文提出链式描述提示方法,通过让多模态大语言模型先详细描述输入内容再回答问题,显著提升了在音频和视觉基准测试中的性能表现。
English: This paper introduces Chain-of-Description (CoD) Prompting, a novel method that enhances Multi-Modal Large Language Models by first generating detailed descriptions of inputs before answering questions, significantly improving performance on audio and vision benchmarks.

Authors:Jan Trienes, Jörg Schlötterer, Junyi Jessy Li, Christin Seifert
Title: Behavioral Analysis of Information Salience in Large Language Models
Abstract:
Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.
Chinese: 大型语言模型通过摘要行为展现出对信息显著性的细致分层理解,但这种内在认知无法通过自省获取,且与人类对重要性的感知仅有微弱关联。
English: Large Language Models demonstrate a nuanced, hierarchical understanding of information salience through their summarization behavior, yet this internalized notion remains inaccessible via introspection and only weakly aligns with human perceptions of importance.

Authors:Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding
Title: MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models
Abstract:
Advancements in Large Language Models (LLMs) and their increasing use in medical question-answering necessitate rigorous evaluation of their reliability. A critical challenge lies in hallucination, where models generate plausible yet factually incorrect outputs. In the medical domain, this poses serious risks to patient safety and clinical decision-making. To address this, we introduce MedHallu, the first benchmark specifically designed for medical hallucination detection. MedHallu comprises 10,000 high-quality question-answer pairs derived from PubMedQA, with hallucinated answers systematically generated through a controlled pipeline. Our experiments show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with this binary hallucination detection task, with the best model achieving an F1 score as low as 0.625 for detecting "hard" category hallucinations. Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth. Through experiments, we also show incorporating domain-specific knowledge and introducing a "not sure" category as one of the answer categories improves the precision and F1 scores by up to 38% relative to baselines.
中文摘要:MedHallu是首个专门用于检测大语言模型医学幻觉的基准,研究表明即使顶尖模型也难以准确识别医疗虚假信息,但引入领域知识和不确定选项能显著提升检测效果。
English Summary: MedHallu is the first specialized benchmark for detecting medical hallucinations in LLMs, revealing that even top models like GPT-4o struggle with this critical safety task, though incorporating domain knowledge and uncertainty options can significantly improve detection performance.

Authors:Abudukelimu Wuerkaixi, Sen Cui, Jingfeng Zhang, Kunda Yan, Bo Han, Gang Niu, Lei Fang, Changshui Zhang, Masashi Sugiyama
Title: Accurate Forgetting for Heterogeneous Federated Continual Learning
Abstract:
Recent years have witnessed a burgeoning interest in federated learning (FL). However, the contexts in which clients engage in sequential learning remain under-explored. Bridging FL and continual learning (CL) gives rise to a challenging practical problem: federated continual learning (FCL). Existing research in FCL primarily focuses on mitigating the catastrophic forgetting issue of continual learning while collaborating with other clients. We argue that the forgetting phenomena are not invariably detrimental. In this paper, we consider a more practical and challenging FCL setting characterized by potentially unrelated or even antagonistic data/tasks across different clients. In the FL scenario, statistical heterogeneity and data noise among clients may exhibit spurious correlations which result in biased feature learning. While existing CL strategies focus on a complete utilization of previous knowledge, we found that forgetting biased information is beneficial in our study. Therefore, we propose a new concept accurate forgetting (AF) and develop a novel generative-replay method~\method~which selectively utilizes previous knowledge in federated networks. We employ a probabilistic framework based on a normalizing flow model to quantify the credibility of previous knowledge. Comprehensive experiments affirm the superiority of our method over baselines.
中文摘要:本文在联邦持续学习中提出“精确遗忘”概念,开发了一种选择性利用历史知识的生成回放方法,以解决客户端间虚假相关性导致的特征学习偏差问题。
English Summary: This paper introduces the concept of "accurate forgetting" in federated continual learning, proposing a generative-replay method that selectively utilizes previous knowledge to address biased feature learning from spurious correlations across clients.

Authors:Hongbo Zhang, Han Cui, Guangsheng Bao, Linyi Yang, Jun Wang, Yue Zhang
Title: Direct Value Optimization: Improving Chain-of-Thought Reasoning in LLMs with Refined Values
Abstract:
We introduce Direct Value Optimization (DVO), an innovative reinforcement learning framework for enhancing large language models in complex reasoning tasks. Unlike traditional methods relying on preference labels, DVO utilizes value signals at individual reasoning steps, optimizing models via a mean squared error loss. The key benefit of DVO lies in its fine-grained supervision, circumventing the need for labor-intensive human annotations. Target values within the DVO are estimated using either Monte Carlo Tree Search or an outcome value model. Our empirical analysis on both mathematical and commonsense reasoning tasks shows that DVO consistently outperforms existing offline preference optimization techniques, even with fewer training steps. These findings underscore the importance of value signals in advancing reasoning capabilities and highlight DVO as a superior methodology under scenarios lacking explicit human preference information.
Chinese: DVO是一种强化学习框架,通过利用推理步骤中的价值信号进行优化,无需人工偏好标签,在数学和常识推理任务中以更少的训练步骤超越现有方法。
English: DVO is a reinforcement learning framework that enhances large language models by using step-level value signals for optimization, eliminating the need for human preference labels and outperforming existing methods in reasoning tasks with fewer training steps.

Authors:Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poświata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Philipp Harries, Loïc Magne, Isabelle Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Šuppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lù, Jordan Clive, Gayatri Krishnakumar, Anna Maksimova, Silvan Wehrli, Maria Tikhonova, Henil Panchal, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jimmy Lin, Howard Yen, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy, Niklas Muennighoff
Title: MMTEB: Massive Multilingual Text Embedding Benchmark
Abstract:
Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
中文: 大规模多语言文本嵌入基准(MMTEB)扩展了MTEB,涵盖250多种语言的500多项任务,发现如multilingual-e5-large-instruct等较小模型能超越大型语言模型,并通过高效降采样方法显著降低计算成本。
English: The Massive Multilingual Text Embedding Benchmark (MMTEB) expands MTEB with over 500 tasks across 250+ languages, revealing that smaller models like multilingual-e5-large-instruct can outperform large language models, while introducing efficient downsampling methods to reduce computational costs.

Authors:Ningke Li, Yahui Song, Kailong Wang, Yuekang Li, Ling Shi, Yi Liu, Haoyu Wang
Title: Detecting LLM Fact-conflicting Hallucinations Enhanced by Temporal-logic-based Reasoning
Abstract:
Large language models (LLMs) face the challenge of hallucinations -- outputs that seem coherent but are actually incorrect. A particularly damaging type is fact-conflicting hallucination (FCH), where generated content contradicts established facts. Addressing FCH presents three main challenges: 1) Automatically constructing and maintaining large-scale benchmark datasets is difficult and resource-intensive; 2) Generating complex and efficient test cases that the LLM has not been trained on -- especially those involving intricate temporal features -- is challenging, yet crucial for eliciting hallucinations; and 3) Validating the reasoning behind LLM outputs is inherently difficult, particularly with complex logical relationships, as it requires transparency in the model's decision-making process. This paper presents Drowzee, an innovative end-to-end metamorphic testing framework that utilizes temporal logic to identify fact-conflicting hallucinations (FCH) in large language models (LLMs). Drowzee builds a comprehensive factual knowledge base by crawling sources like Wikipedia and uses automated temporal-logic reasoning to convert this knowledge into a large, extensible set of test cases with ground truth answers. LLMs are tested using these cases through template-based prompts, which require them to generate both answers and reasoning steps. To validate the reasoning, we propose two semantic-aware oracles that compare the semantic structure of LLM outputs to the ground truths. Across nine LLMs in nine different knowledge domains, experimental results show that Drowzee effectively identifies rates of non-temporal-related hallucinations ranging from 24.7% to 59.8%, and rates of temporal-related hallucinations ranging from 16.7% to 39.2%.
中文: 本文提出Drowzee框架,通过构建知识库并利用时序逻辑检测大语言模型中的事实冲突幻觉,采用语义感知验证机制,在多领域测试中成功识别出16.7%-59.8%的时序与非时序幻觉。
English: This paper introduces Drowzee, an end-to-end metamorphic testing framework that uses temporal logic to detect fact-conflicting hallucinations in LLMs by constructing a knowledge base and validating reasoning through semantic-aware oracles, effectively identifying both non-temporal and temporal hallucinations across multiple models.

Authors:Konstantin Hess, Dennis Frauen, Valentyn Melnychuk, Stefan Feuerriegel
Title: Efficient and Sharp Off-Policy Learning under Unobserved Confounding
Abstract:
We develop a novel method for personalized off-policy learning in scenarios with unobserved confounding. Thereby, we address a key limitation of standard policy learning: standard policy learning assumes unconfoundedness, meaning that no unobserved factors influence both treatment assignment and outcomes. However, this assumption is often violated, because of which standard policy learning produces biased estimates and thus leads to policies that can be harmful. To address this limitation, we employ causal sensitivity analysis and derive a statistically efficient estimator for a sharp bound on the value function under unobserved confounding. Our estimator has three advantages: (1) Unlike existing works, our estimator avoids unstable minimax optimization based on inverse propensity weighted outcomes. (2) Our estimator is statistically efficient. (3) We prove that our estimator leads to the optimal confounding-robust policy. Finally, we extend our theory to the related task of policy improvement under unobserved confounding, i.e., when a baseline policy such as the standard of care is available. We show in experiments with synthetic and real-world data that our method outperforms simple plug-in approaches and existing baselines. Our method is highly relevant for decision-making where unobserved confounding can be problematic, such as in healthcare and public policy.
中文: 我们提出了一种新颖的个性化反事实策略学习方法,通过因果敏感性分析解决未观测混杂问题,提供了统计高效的估计器以获得最优稳健策略,并在实验中展现出卓越性能。
English: We introduce a novel personalized off-policy learning method that addresses unobserved confounding through causal sensitivity analysis, providing a statistically efficient estimator for optimal robust policies and demonstrating superior performance in experiments.

Authors:Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu
Title: Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger
Abstract:
Large language models (LLMs) have shown remarkable emergent capabilities, transforming the execution of functional tasks by leveraging external tools for complex problems that require specialized processing or up-to-date data. While existing research expands LLMs access to diverse tools (e.g., program interpreters, search engines, calculators), the necessity of using these tools is often overlooked, leading to indiscriminate tool invocation. This naive approach raises two key issues: increased latency due to unnecessary tool calls, and potential errors resulting from faulty interactions with external tools. In this paper, we introduce meta-cognition as a proxy for LLMs self-assessment of their capabilities, reflecting the model's awareness of its own limitations. Based on this, we propose MeCo, an adaptive decision-making strategy for external tool use. MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space, guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs minimal cost. Experiments across multiple backbone models and benchmarks show that MeCo reliably detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
Chinese: 大语言模型常低效调用外部工具,因此本文提出基于元认知的MeCo策略,通过检测认知信号实现自适应工具调用而无需微调,从而有效降低延迟和错误率。
English: Large language models often inefficiently use external tools, so this paper introduces MeCo, a metacognition-based strategy that enables adaptive tool invocation by detecting cognitive signals without fine-tuning, thereby reducing latency and errors.

Authors:Mengshi Qi, Changsheng Lv, Huadong Ma
Title: Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Abstract:
In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge being how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed RDCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model's reasoning ability by modeling physical knowledge relationships among different objects under counterfactual intervention. To alleviate the incomplete modality data issue, we introduce a robust multimodal learning method to recover the missing data by decomposing the shared features and model-specific features. Our proposed method is a plug-and-play module that can be incorporated into any baseline including VLMs. In experiments, we show that our proposed method improves the reasoning accuracy and robustness of baseline methods and achieves the state-of-the-art performance.
中文摘要:本文提出了一种鲁棒解耦反事实学习(RDCL)方法,通过分离视频特征和采用反事实学习来增强物理视听推理能力,在模态缺失情况下仍实现了最先进的性能。
English Summary: This paper introduces a Robust Disentangled Counterfactual Learning (RDCL) method that enhances physical audiovisual reasoning by disentangling video features and employing counterfactual learning, achieving state-of-the-art performance despite missing modalities.

Authors:En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
Title: Unhackable Temporal Rewarding for Scalable Video MLLMs
Abstract:
In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.
中文总结:本研究揭示了视频处理多模态大模型中导致性能退化的"时间黑客"现象,提出了时间困惑度指标和不可黑客时间奖励框架,不仅能有效应对该问题,还显著提升了视频理解能力。
English Summary: This study identifies "temporal hacking" as the cause of performance degradation in video-processing MLLMs and introduces the Temporal Perplexity metric and Unhackable Temporal Rewarding framework to effectively counteract this issue while enhancing video comprehension.

Authors:Klara Reichard, Giulia Rizzoli, Stefano Gasperini, Lukas Hoyer, Pietro Zanuttigh, Nassir Navab, Federico Tombari
Title: From Open-Vocabulary to Vocabulary-Free Semantic Segmentation
Abstract:
Open-vocabulary semantic segmentation enables models to identify novel object categories beyond their training data. While this flexibility represents a significant advancement, current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real-world applications. This work proposes a Vocabulary-Free Semantic Segmentation pipeline, eliminating the need for predefined class vocabularies. Specifically, we address the chicken-and-egg problem where users need knowledge of all potential objects within a scene to identify them, yet the purpose of segmentation is often to discover these objects. The proposed approach leverages Vision-Language Models to automatically recognize objects and generate appropriate class names, aiming to solve the challenge of class specification and naming quality. Through extensive experiments on several public datasets, we highlight the crucial role of the text encoder in model performance, particularly when the image text classes are paired with generated descriptions. Despite the challenges introduced by the sensitivity of the segmentation text encoder to false negatives within the class tagging process, which adds complexity to the task, we demonstrate that our fully automated pipeline significantly enhances vocabulary-free segmentation accuracy across diverse real-world scenarios.
中文摘要:本文提出了一种无需预定义词汇的语义分割方法,利用视觉语言模型自动识别并命名物体,尽管存在文本编码器对类别标注中假阴性敏感的挑战,但在多种现实场景中显著提升了无词汇分割的准确性。
English Summary: This paper introduces a Vocabulary-Free Semantic Segmentation method that uses Vision-Language Models to automatically recognize and name objects without predefined classes, significantly improving segmentation accuracy across various real-world applications despite challenges in text encoder sensitivity.

Authors:Jinyu Miao, Rujun Yan, Bowei Zhang, Tuopu Wen, Kun Jiang, Mengmeng Yang, Jin Huang, Zhihua Zhong, Diange Yang
Title: Residual Learning towards High-fidelity Vehicle Dynamics Modeling with Transformer
Abstract:
The vehicle dynamics model serves as a vital component of autonomous driving systems, as it describes the temporal changes in vehicle state. In a long period, researchers have made significant endeavors to accurately model vehicle dynamics. Traditional physics-based methods employ mathematical formulae to model vehicle dynamics, but they are unable to adequately describe complex vehicle systems due to the simplifications they entail. Recent advancements in deep learning-based methods have addressed this limitation by directly regressing vehicle dynamics. However, the performance and generalization capabilities still require further enhancement. In this letter, we address these problems by proposing a vehicle dynamics correction system that leverages deep neural networks to correct the state residuals of a physical model instead of directly estimating the states. This system greatly reduces the difficulty of network learning and thus improves the estimation accuracy of vehicle dynamics. Furthermore, we have developed a novel Transformer-based dynamics residual correction network, DyTR. This network implicitly represents state residuals as high-dimensional queries, and iteratively updates the estimated residuals by interacting with dynamics state features. The experiments in simulations demonstrate the proposed system works much better than physics model, and our proposed DyTR model achieves the best performances on dynamics state residual correction task, reducing the state prediction errors of a simple 3 DoF vehicle model by an average of 92.3% and 59.9% in two dataset, respectively.
中文: 本文提出了一种车辆动力学校正系统,利用深度神经网络修正物理模型的状态残差,提高了精度和泛化能力,并通过新型基于Transformer的DyTR网络在仿真中显著降低了预测误差。
English: This letter introduces a vehicle dynamics correction system that uses deep neural networks to refine state residuals from a physical model, enhancing accuracy and generalization, with a novel Transformer-based network, DyTR, achieving significant error reductions in simulations.

Authors:Ming Xie, Chenjie Cao, Yunuo Cai, Xiangyang Xue, Yu-Gang Jiang, Yanwei Fu
Title: AnyRefill: A Unified, Data-Efficient Framework for Left-Prompt-Guided Vision Tasks
Abstract:
In this paper, we present a novel Left-Prompt-Guided (LPG) paradigm to address a diverse range of reference-based vision tasks. Inspired by the human creative process, we reformulate these tasks using a left-right stitching formulation to construct contextual input. Building upon this foundation, we propose AnyRefill, an extension of LeftRefill, that effectively adapts Text-to-Image (T2I) models to various vision tasks. AnyRefill leverages the inpainting priors of advanced T2I model based on the Diffusion Transformer (DiT) architecture, and incorporates flexible components to enhance its capabilities. By combining task-specific LoRAs with the stitching input, AnyRefill unlocks its potential across diverse tasks, including conditional generation, visual perception, and image editing, without requiring additional visual encoders. Meanwhile, AnyRefill exhibits remarkable data efficiency, requiring minimal task-specific fine-tuning while maintaining high generative performance. Through extensive ablation studies, we demonstrate that AnyRefill outperforms other image condition injection methods and achieves competitive results compared to state-of-the-art open-source methods. Notably, AnyRefill delivers results comparable to advanced commercial tools, such as IC-Light and SeedEdit, even in challenging scenarios. Comprehensive experiments and ablation studies across versatile tasks validate the strong generation of the proposed simple yet effective LPG formulation, establishing AnyRefill as a unified, highly data-efficient solution for reference-based vision tasks.
中文: 本文提出左提示引导范式及其实现AnyRefill,通过巧妙结合扩散变换器架构与任务适配模块,仅需少量微调即可在多种参考视觉任务中实现卓越性能,媲美先进商业工具。
English: This paper introduces the Left-Prompt-Guided paradigm and its implementation AnyRefill, which effectively adapts Text-to-Image models to diverse reference-based vision tasks with minimal fine-tuning while achieving competitive performance against state-of-the-art methods.

Authors:Silong Yong, Venkata Nagarjun Pudureddiyur Manivannan, Bernhard Kerbl, Zifu Wan, Simon Stepputtis, Katia Sycara, Yaqi Xie
Title: OMG: Opacity Matters in Material Modeling with Gaussian Splatting
Abstract:
Decomposing geometry, materials and lighting from a set of images, namely inverse rendering, has been a long-standing problem in computer vision and graphics. Recent advances in neural rendering enable photo-realistic and plausible inverse rendering results. The emergence of 3D Gaussian Splatting has boosted it to the next level by showing real-time rendering potentials. An intuitive finding is that the models used for inverse rendering do not take into account the dependency of opacity w.r.t. material properties, namely cross section, as suggested by optics. Therefore, we develop a novel approach that adds this dependency to the modeling itself. Inspired by radiative transfer, we augment the opacity term by introducing a neural network that takes as input material properties to provide modeling of cross section and a physically correct activation function. The gradients for material properties are therefore not only from color but also from opacity, facilitating a constraint for their optimization. Therefore, the proposed method incorporates more accurate physical properties compared to previous works. We implement our method into 3 different baselines that use Gaussian Splatting for inverse rendering and achieve significant improvements universally in terms of novel view synthesis and material modeling.
中文: 本研究提出了一种新颖的逆渲染方法,通过基于辐射传输原理引入材料相关的透明度建模,在多种高斯溅射基线上实现了新视角合成和材质建模的显著提升。
English: This study introduces a novel inverse rendering approach that incorporates material-dependent opacity modeling based on radiative transfer principles, achieving significant improvements in both novel view synthesis and material representation across multiple Gaussian Splatting baselines.

Authors:Xudong Yang, Yizhang Zhu, Hanfeng Liu, Zeyi Wen, Nan Tang, Yuyu Luo
Title: RAMer: Reconstruction-based Adversarial Model for Multi-party Multi-modal Multi-label Emotion Recognition
Abstract:
Conventional Multi-modal multi-label emotion recognition (MMER) assumes complete access to visual, textual, and acoustic modalities. However, real-world multi-party settings often violate this assumption, as non-speakers frequently lack acoustic and textual inputs, leading to a significant degradation in model performance. Existing approaches also tend to unify heterogeneous modalities into a single representation, overlooking each modality's unique characteristics. To address these challenges, we propose RAMer (Reconstruction-based Adversarial Model for Emotion Recognition), which refines multi-modal representations by not only exploring modality commonality and specificity but crucially by leveraging reconstructed features, enhanced by contrastive learning, to overcome data incompleteness and enrich feature quality. RAMer also introduces a personality auxiliary task to complement missing modalities using modality-level attention, improving emotion reasoning. To further strengthen the model's ability to capture label and modality interdependency, we propose a stack shuffle strategy to enrich correlations between labels and modality-specific features. Experiments on three benchmarks, i.e., MEmoR, CMU-MOSEI, and $M^3ED$, demonstrate that RAMer achieves state-of-the-art performance in dyadic and multi-party MMER scenarios.
Chinese: RAMer通过重构特征和对比学习解决多模态数据不完整问题,结合个性辅助任务和堆叠混洗策略增强标签与模态间的关联,在多个基准测试中实现了最先进的性能。
English: RAMer enhances multi-modal emotion recognition by reconstructing features and using contrastive learning to address data incompleteness, while incorporating personality tasks and a stack shuffle strategy to improve label-modality correlations, achieving state-of-the-art results on multiple benchmarks.

Authors:Jiexin Ding, Bowen Zhao, Yuntao Wang, Xinyun Liu, Rui Hao, Ishan Chatterjee, Yuanchun Shi
Title: Unknown Word Detection for English as a Second Language (ESL) Learners Using Gaze and Pre-trained Language Models
Abstract:
English as a Second Language (ESL) learners often encounter unknown words that hinder their text comprehension. Automatically detecting these words as users read can enable computing systems to provide just-in-time definitions, synonyms, or contextual explanations, thereby helping users learn vocabulary in a natural and seamless manner. This paper presents EyeLingo, a transformer-based machine learning method that predicts the probability of unknown words based on text content and eye gaze trajectory in real time with high accuracy. A 20-participant user study revealed that our method can achieve an accuracy of 97.6%, and an F1-score of 71.1%. We implemented a real-time reading assistance prototype to show the effectiveness of EyeLingo. The user study shows improvement in willingness to use and usefulness compared to baseline methods.
中文:EyeLingo是一种基于Transformer的方法,通过结合文本内容和眼动轨迹实时预测ESL学习者的生词,准确率达97.6%,有效提升了阅读辅助的实用性和用户体验。
English: EyeLingo is a transformer-based method that accurately predicts ESL learners' unfamiliar words using text and eye gaze data, achieving 97.6% accuracy and enhancing reading assistance effectiveness.

Authors:Wei Wang, Dong-Dong Wu, Jindong Wang, Gang Niu, Min-Ling Zhang, Masashi Sugiyama
Title: Realistic Evaluation of Deep Partial-Label Learning Algorithms
Abstract:
Partial-label learning (PLL) is a weakly supervised learning problem in which each example is associated with multiple candidate labels and only one is the true label. In recent years, many deep PLL algorithms have been developed to improve model performance. However, we find that some early developed algorithms are often underestimated and can outperform many later algorithms with complicated designs. In this paper, we delve into the empirical perspective of PLL and identify several critical but previously overlooked issues. First, model selection for PLL is non-trivial, but has never been systematically studied. Second, the experimental settings are highly inconsistent, making it difficult to evaluate the effectiveness of the algorithms. Third, there is a lack of real-world image datasets that can be compatible with modern network architectures. Based on these findings, we propose PLENCH, the first Partial-Label learning bENCHmark to systematically compare state-of-the-art deep PLL algorithms. We investigate the model selection problem for PLL for the first time, and propose novel model selection criteria with theoretical guarantees. We also create Partial-Label CIFAR-10 (PLCIFAR10), an image dataset of human-annotated partial labels collected from Amazon Mechanical Turk, to provide a testbed for evaluating the performance of PLL algorithms in more realistic scenarios. Researchers can quickly and conveniently perform a comprehensive and fair evaluation and verify the effectiveness of newly developed algorithms based on PLENCH. We hope that PLENCH will facilitate standardized, fair, and practical evaluation of PLL algorithms in the future.
Chinese: 本文提出了首个部分标签学习基准PLENCH,通过解决实验设置不一致和缺乏真实数据集等关键问题,为PLL算法提供了公平且标准化的评估框架。
English: This paper introduces PLENCH, the first comprehensive benchmark for partial-label learning (PLL), addressing critical issues like inconsistent experimental settings and the lack of real-world datasets to enable fair and standardized evaluation of PLL algorithms.

Authors:Zhipeng Li, Yishu Ji, Ruijia Chen, Tianqi Liu, Yuntao Wang, Yuanchun Shi, Yukang Yan
Title: Modeling the Impact of Visual Stimuli on Redirection Noticeability with Gaze Behavior in Virtual Reality
Abstract:
While users could embody virtual avatars that mirror their physical movements in Virtual Reality, these avatars' motions can be redirected to enable novel interactions. Excessive redirection, however, could break the user's sense of embodiment due to perceptual conflicts between vision and proprioception. While prior work focused on avatar-related factors influencing the noticeability of redirection, we investigate how the visual stimuli in the surrounding virtual environment affect user behavior and, in turn, the noticeability of redirection. Given the wide variety of different types of visual stimuli and their tendency to elicit varying individual reactions, we propose to use users' gaze behavior as an indicator of their response to the stimuli and model the noticeability of redirection. We conducted two user studies to collect users' gaze behavior and noticeability, investigating the relationship between them and identifying the most effective gaze behavior features for predicting noticeability. Based on the data, we developed a regression model that takes users' gaze behavior as input and outputs the noticeability of redirection. We then conducted an evaluation study to test our model on unseen visual stimuli, achieving an accuracy of 0.012 MSE. We further implemented an adaptive redirection technique and conducted a proof-of-concept study to evaluate its effectiveness with complex visual stimuli in two applications. The results indicated that participants experienced less physical demanding and a stronger sense of body ownership when using our adaptive technique, demonstrating the potential of our model to support real-world use cases.
中文: 本研究开发了一种基于视线行为的模型来预测虚拟现实中化身运动重定向的可察觉性,通过响应用户的视觉注意力,自适应技术能减轻身体负担并增强身体拥有感。
English: This study develops a gaze-based model to predict the noticeability of avatar motion redirection in Virtual Reality, enabling adaptive techniques that reduce physical strain and enhance body ownership by responding to users' visual attention.

Authors:Sibo Cheng, Marc Bocquet, Weiping Ding, Tobias Sebastian Finn, Rui Fu, Jinlong Fu, Yike Guo, Eleda Johnson, Siyi Li, Che Liu, Eric Newton Moro, Jie Pan, Matthew Piggott, Cesar Quilodran, Prakhar Sharma, Kun Wang, Dunhui Xiao, Xiao Xue, Yong Zeng, Mingrui Zhang, Hao Zhou, Kewei Zhu, Rossella Arcucci
Title: Machine learning for modelling unstructured grid data in computational physics: a review
Abstract:
Unstructured grid data are essential for modelling complex geometries and dynamics in computational physics. Yet, their inherent irregularity presents significant challenges for conventional machine learning (ML) techniques. This paper provides a comprehensive review of advanced ML methodologies designed to handle unstructured grid data in high-dimensional dynamical systems. Key approaches discussed include graph neural networks, transformer models with spatial attention mechanisms, interpolation-integrated ML methods, and meshless techniques such as physics-informed neural networks. These methodologies have proven effective across diverse fields, including fluid dynamics and environmental simulations. This review is intended as a guidebook for computational scientists seeking to apply ML approaches to unstructured grid data in their domains, as well as for ML researchers looking to address challenges in computational physics. It places special focus on how ML methods can overcome the inherent limitations of traditional numerical techniques and, conversely, how insights from computational physics can inform ML development. To support benchmarking, this review also provides a summary of open-access datasets of unstructured grid data in computational physics. Finally, emerging directions such as generative models with unstructured data, reinforcement learning for mesh generation, and hybrid physics-data-driven paradigms are discussed to inspire future advancements in this evolving field.
中文: 本文综述了计算物理中处理非结构化网格数据的先进机器学习方法,重点介绍了图神经网络和物理信息神经网络等技术,这些方法既能突破传统数值技术的局限,又促进了机器学习与计算物理学的相互启发。
English: This review explores advanced machine learning methods for handling unstructured grid data in computational physics, highlighting techniques like graph neural networks and physics-informed neural networks that overcome traditional limitations while bridging insights between ML and physics.

Authors:Xuzhao Geng, Haozhao Wang, Jun Wang, Wei Liu, Ruixuan Li
Title: Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables
Abstract:
Retrieval-augmented generation (RAG) is a key technique for leveraging external knowledge and reducing hallucinations in large language models (LLMs). However, RAG still struggles to fully prevent hallucinated responses. To address this, it is essential to identify samples prone to hallucination or guide LLMs toward correct responses, which experts then annotate to develop high-quality datasets for refining LLMs. However, the growing scarcity of such datasets makes their creation challenging. This paper proposes using the vast amount of conversations from widespread LLM usage to build these datasets, training LLMs to avoid hallucination-prone questions while accurately responding to manageable ones. Given the impracticality of expert-annotating all conversation records, the paper introduces AL4RAG, which uses active learning to select the most suitable conversation samples for annotation, optimizing performance within an annotation budget. Additionally, recognizing that traditional active learning methods are not fully compatible with RAG due to unsuitable distance metrics, we develop a novel sample distance measurement for RAG active learning. Extensive experiments show that our method consistently outperforms baselines across multiple metrics.
Chinese: 检索增强生成(RAG)仍无法完全避免大语言模型(LLM)的幻觉问题,为此本文提出AL4RAG主动学习方法,通过优选对话样本进行标注,高效构建高质量数据集,训练LLM规避易产生幻觉的提问并准确回答可处理的问题。
English: Retrieval-augmented generation (RAG) still fails to fully prevent hallucinations in large language models (LLMs), so this paper introduces AL4RAG, an active learning approach that selects optimal conversation samples for annotation to efficiently build high-quality datasets and train LLMs to avoid hallucination-prone questions.

Authors:Min Hou, Chenxi Bai, Le Wu, Hao Liu, Kun Zhang, Kai Zhang, Richang Hong, Meng Wang
Title: MoLoRec: A Generalizable and Efficient Framework for LLM-Based Recommendation
Abstract:
Large Language Models (LLMs) have achieved remarkable success in recent years, owing to their impressive generalization capabilities and rich world knowledge. To capitalize on the potential of using LLMs as recommender systems, mainstream approaches typically focus on two paradigms. The first paradigm designs multi-domain or multi-task instruction data for generalizable recommendation, so as to align LLMs with general recommendation areas and deal with cold-start recommendation. The second paradigm enhances domain-specific recommendation tasks with parameter-efficient fine-tuning techniques, in order to improve models under the warm recommendation scenarios. While most previous works treat these two paradigms separately, we argue that they have complementary advantages, and combining them together would be helpful. To that end, in this paper, we propose a generalizable and efficient LLM-based recommendation framework MoLoRec. Our approach starts by parameter-efficient fine-tuning a domain-general module with general recommendation instruction data, to align LLM with recommendation knowledge. Then, given users' behavior of a specific domain, we construct a domain-specific instruction dataset and apply efficient fine-tuning to the pre-trained LLM. After that, we provide approaches to integrate the above domain-general part and domain-specific part with parameters mixture. Please note that, MoLoRec is efficient with plug and play, as the domain-general module is trained only once, and any domain-specific plug-in can be efficiently merged with only domain-specific fine-tuning. Extensive experiments on multiple datasets under both warm and cold-start recommendation scenarios validate the effectiveness and generality of the proposed MoLoRec.
中文: 本文提出MoLoRec框架,通过参数高效微调实现领域通用对齐与领域特定适应的协同融合,在热启动和冷启动推荐场景下均验证了其有效性和普适性。
English: This paper introduces MoLoRec, a novel LLM-based recommendation framework that synergistically combines parameter-efficient fine-tuning for domain-general alignment with domain-specific adaptation to enhance performance in both warm and cold-start scenarios.

Authors:Vy Vo, Lizhen Qu, Tao Feng, Yuncheng Hua, Xiaoxi Kang, Songhai Fan, Tim Dwyer, Lay-Ki Soon, Gholamreza Haffari
Title: ACCESS : A Benchmark for Abstract Causal Event Discovery and Reasoning
Abstract:
Identifying cause-and-effect relationships is critical to understanding real-world dynamics and ultimately causal reasoning. Existing methods for identifying event causality in NLP, including those based on Large Language Models (LLMs), exhibit difficulties in out-of-distribution settings due to the limited scale and heavy reliance on lexical cues within available benchmarks. Modern benchmarks, inspired by probabilistic causal inference, have attempted to construct causal graphs of events as a robust representation of causal knowledge, where \texttt{CRAB} \citep{romanou2023crab} is one such recent benchmark along this line. In this paper, we introduce \texttt{ACCESS}, a benchmark designed for discovery and reasoning over abstract causal events. Unlike existing resources, \texttt{ACCESS} focuses on causality of everyday life events on the abstraction level. We propose a pipeline for identifying abstractions for event generalizations from \texttt{GLUCOSE} \citep{mostafazadeh-etal-2020-glucose}, a large-scale dataset of implicit commonsense causal knowledge, from which we subsequently extract $1,4$K causal pairs. Our experiments highlight the ongoing challenges of using statistical methods and/or LLMs for automatic abstraction identification and causal discovery in NLP. Nonetheless, we demonstrate that the abstract causal knowledge provided in \texttt{ACCESS} can be leveraged for enhancing QA reasoning performance in LLMs.
中文摘要:本文介绍了ACCESS基准,专注于日常生活事件的抽象因果关系推理,通过从GLUCOSE数据提取因果对构建,既揭示了当前方法与LLM在因果发现中的局限,也展示了其提升问答推理能力的潜力。
English Summary: The paper introduces ACCESS, a benchmark for abstract causal event reasoning that addresses limitations in existing NLP methods by leveraging everyday event abstractions from GLUCOSE data, showing both challenges and potential benefits for LLM reasoning enhancement.

Authors:Yoshihiko Furuhashi, Junichi Yamagishi, Xin Wang, Huy H. Nguyen, Isao Echizen
Title: Exploring Active Data Selection Strategies for Continuous Training in Deepfake Detection
Abstract:
In deepfake detection, it is essential to maintain high performance by adjusting the parameters of the detector as new deepfake methods emerge. In this paper, we propose a method to automatically and actively select the small amount of additional data required for the continuous training of deepfake detection models in situations where deepfake detection models are regularly updated. The proposed method automatically selects new training data from a \textit{redundant} pool set containing a large number of images generated by new deepfake methods and real images, using the confidence score of the deepfake detection model as a metric. Experimental results show that the deepfake detection model, continuously trained with a small amount of additional data automatically selected and added to the original training set, significantly and efficiently improved the detection performance, achieving an EER of 2.5% with only 15% of the amount of data in the pool set.
Chinese: 本文提出一种自动主动选择少量新增数据的方法,用于持续训练深度伪造检测模型,仅使用池集中15%的数据便显著提升检测性能,实现2.5%的等错误率。
English: This paper introduces an automated method for actively selecting minimal additional data from a large pool to continuously train deepfake detection models, significantly enhancing detection performance with only 15% of the data and achieving a 2.5% EER.

Authors:Roohan Ahmed Khan, Valerii Serpiva, Demetros Aschalew, Aleksey Fedoseev, Dzmitry Tsetserukou
Title: AgilePilot: DRL-Based Drone Agent for Real-Time Motion Planning in Dynamic Environments by Leveraging Object Detection
Abstract:
Autonomous drone navigation in dynamic environments remains a critical challenge, especially when dealing with unpredictable scenarios including fast-moving objects with rapidly changing goal positions. While traditional planners and classical optimisation methods have been extensively used to address this dynamic problem, they often face real-time, unpredictable changes that ultimately leads to sub-optimal performance in terms of adaptiveness and real-time decision making. In this work, we propose a novel motion planner, AgilePilot, based on Deep Reinforcement Learning (DRL) that is trained in dynamic conditions, coupled with real-time Computer Vision (CV) for object detections during flight. The training-to-deployment framework bridges the Sim2Real gap, leveraging sophisticated reward structures that promotes both safety and agility depending upon environment conditions. The system can rapidly adapt to changing environments, while achieving a maximum speed of 3.0 m/s in real-world scenarios. In comparison, our approach outperforms classical algorithms such as Artificial Potential Field (APF) based motion planner by 3 times, both in performance and tracking accuracy of dynamic targets by using velocity predictions while exhibiting 90% success rate in 75 conducted experiments. This work highlights the effectiveness of DRL in tackling real-time dynamic navigation challenges, offering intelligent safety and agility.
中文: 本文提出基于深度强化学习的AgilePilot运动规划器,通过结合实时计算机视觉和速度预测,显著提升了无人机在动态环境中的导航能力,在适应性和跟踪精度方面均优于传统算法。
English: This paper introduces AgilePilot, a Deep Reinforcement Learning-based motion planner that enhances autonomous drone navigation in dynamic environments by integrating real-time computer vision and velocity predictions, achieving superior adaptability and tracking accuracy compared to classical methods.

Authors:Malaika Zafar, Roohan Ahmed Khan, Aleksey Fedoseev, Kumar Katyayan Jaiswal, Dzmitry Tsetserukou
Title: HetSwarm: Cooperative Navigation of Heterogeneous Swarm in Dynamic and Dense Environments through Impedance-based Guidance
Abstract:
With the growing demand for efficient logistics and warehouse management, unmanned aerial vehicles (UAVs) are emerging as a valuable complement to automated guided vehicles (AGVs). UAVs enhance efficiency by navigating dense environments and operating at varying altitudes. However, their limited flight time, battery life, and payload capacity necessitate a supporting ground station. To address these challenges, we propose HetSwarm, a heterogeneous multi-robot system that combines a UAV and a mobile ground robot for collaborative navigation in cluttered and dynamic conditions. Our approach employs an artificial potential field (APF)-based path planner for the UAV, allowing it to dynamically adjust its trajectory in real time. The ground robot follows this path while maintaining connectivity through impedance links, ensuring stable coordination. Additionally, the ground robot establishes temporal impedance links with low-height ground obstacles to avoid local collisions, as these obstacles do not interfere with the UAV's flight. Experimental validation of HetSwarm in diverse environmental conditions demonstrated a 90% success rate across 30 test cases. The ground robot exhibited an average deviation of 45 cm near obstacles, confirming effective collision avoidance. Extensive simulations in the Gym PyBullet environment further validated the robustness of our system for real-world applications, demonstrating its potential for dynamic, real-time task execution in cluttered environments.
中文: HetSwarm是一种结合无人机与地面机器人的异构多机器人系统,通过实时路径规划和阻抗协调在复杂环境中协同导航,实验验证其成功率高达90%。
English: HetSwarm is a heterogeneous multi-robot system combining a UAV and a ground robot that collaboratively navigate cluttered environments using real-time path planning and impedance-based coordination, achieving a 90% success rate in experiments.

Authors:Malaika Zafar, Roohan Ahmed Khan, Aleksey Fedoseev, Kumar Katyayan Jaiswal, Dzmitry Tsetserukou
Title: HetSwarm: Cooperative Navigation of Heterogeneous Swarm in Dynamic and Dense Environments through Impedance-based Guidance
Abstract:
With the growing demand for efficient logistics and warehouse management, unmanned aerial vehicles (UAVs) are emerging as a valuable complement to automated guided vehicles (AGVs). UAVs enhance efficiency by navigating dense environments and operating at varying altitudes. However, their limited flight time, battery life, and payload capacity necessitate a supporting ground station. To address these challenges, we propose HetSwarm, a heterogeneous multi-robot system that combines a UAV and a mobile ground robot for collaborative navigation in cluttered and dynamic conditions. Our approach employs an artificial potential field (APF)-based path planner for the UAV, allowing it to dynamically adjust its trajectory in real time. The ground robot follows this path while maintaining connectivity through impedance links, ensuring stable coordination. Additionally, the ground robot establishes temporal impedance links with low-height ground obstacles to avoid local collisions, as these obstacles do not interfere with the UAV's flight. Experimental validation of HetSwarm in diverse environmental conditions demonstrated a 90% success rate across 30 test cases. The ground robot exhibited an average deviation of 45 cm near obstacles, confirming effective collision avoidance. Extensive simulations in the Gym PyBullet environment further validated the robustness of our system for real-world applications, demonstrating its potential for dynamic, real-time task execution in cluttered environments.
中文: HetSwarm是一种结合无人机与地面机器人的异构多机器人系统,通过实时路径规划和阻抗协调在复杂环境中协同导航,实验验证其成功率高达90%。
English: HetSwarm is a heterogeneous multi-robot system combining a UAV and a ground robot that collaboratively navigate cluttered environments using real-time path planning and impedance-based coordination, achieving a 90% success rate in experiments.

Authors:Marcell Bartos, Alexandre Didier, Jerome Sieber, Johannes Köhler, Melanie N. Zeilinger
Title: Stochastic MPC with Online-optimized Policies and Closed-loop Guarantees
Abstract:
This paper proposes a stochastic model predictive control method for linear systems affected by additive Gaussian disturbances. Closed-loop satisfaction of probabilistic constraints and recursive feasibility of the underlying convex optimization problem is guaranteed. Optimization over feedback policies online increases performance and reduces conservatism compared to fixed-feedback approaches. The central mechanism is a finitely determined maximal admissible set for probabilistic constraints, together with the reconditioning of the predicted probabilistic constraints on the current knowledge at every time step. The proposed method's reduced conservatism and improved performance in terms of the achieved closed-loop cost is demonstrated in a numerical example.
本文提出了一种针对受高斯扰动线性系统的随机模型预测控制方法,通过在线优化反馈策略保证概率约束满足和优化可行性,有效提升了控制性能并降低了保守性。
This paper introduces a stochastic model predictive control approach for linear systems with Gaussian disturbances, ensuring constraint satisfaction and optimization feasibility while enhancing performance through online feedback policy adjustments.

Authors:Longtao Xiao, Haozhao Wang, Cheng Wang, Linfei Ji, Yifan Wang, Jieming Zhu, Zhenhua Dong, Rui Zhang, Ruixuan Li
Title: UNGER: Generative Recommendation with A Unified Code via Semantic and Collaborative Integration
Abstract:
With the rise of generative paradigms, generative recommendation has garnered increasing attention. The core component is the item code, generally derived by quantizing collaborative or semantic representations to serve as candidate items identifiers in the context. However, existing methods typically construct separate codes for each modality, leading to higher computational and storage costs and hindering the integration of their complementary strengths. Considering this limitation, we seek to integrate two different modalities into a unified code, fully unleashing the potential of complementary nature among modalities. Nevertheless, the integration remains challenging: the integrated embedding obtained by the common concatenation method would lead to underutilization of collaborative knowledge, thereby resulting in limited effectiveness. To address this, we propose a novel method, named UNGER, which integrates semantic and collaborative knowledge into a unified code for generative recommendation. Specifically, we propose to adaptively learn an integrated embedding through the joint optimization of cross-modality knowledge alignment and next item prediction tasks. Subsequently, to mitigate the information loss caused by the quantization process, we introduce an intra-modality knowledge distillation task, using the integrated embeddings as supervised signals to compensate. Extensive experiments on three widely used benchmarks demonstrate the superiority of our approach compared to existing methods.
中文:生成式推荐因多模态编码分离而面临挑战,为此我们提出UNGER方法,通过自适应学习和知识蒸馏将协同与语义知识融合为统一编码,从而提升推荐效果。
English: Generative recommendation faces challenges with separate modality codes, so we propose UNGER, a method that integrates collaborative and semantic knowledge into unified codes through adaptive learning and knowledge distillation to enhance performance.

Authors:Manh Luong, Khai Nguyen, Dinh Phung, Gholamreza Haffari, Lizhen Qu
Title: Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Abstract:
Teacher-forcing training for audio captioning usually leads to exposure bias due to training and inference mismatch. Prior works propose the contrastive method to deal with caption degeneration. However, the contrastive method ignores the temporal information when measuring similarity across acoustic and linguistic modalities, leading to inferior performance. In this work, we develop the temporal-similarity score by introducing the unbiased sliced Wasserstein RBF (USW-RBF) kernel equipped with rotary positional embedding to account for temporal information across modalities. In contrast to the conventional sliced Wasserstein RBF kernel, we can form an unbiased estimation of USW-RBF kernel via Monte Carlo estimation. Therefore, it is well-suited to stochastic gradient optimization algorithms, and its approximation error decreases at a parametric rate of $\mathcal{O}(L^{-1/2})$ with $L$ Monte Carlo samples. Additionally, we introduce an audio captioning framework based on the unbiased sliced Wasserstein kernel, incorporating stochastic decoding methods to mitigate caption degeneration during the generation process. We conduct extensive quantitative and qualitative experiments on two datasets, AudioCaps and Clotho, to illustrate the capability of generating high-quality audio captions. Experimental results show that our framework is able to increase caption length, lexical diversity, and text-to-audio self-retrieval accuracy.
中文摘要:本研究提出了一种结合旋转位置编码的无偏切片Wasserstein RBF核的时间相似度评分方法,解决了音频描述中跨模态时序信息忽略的问题,并通过随机解码框架显著提升了描述长度、词汇多样性和音频自检索准确率。
English Summary: This study introduces a temporal-similarity score using an unbiased sliced Wasserstein RBF kernel with rotary positional embedding to address temporal information neglect in audio captioning, and develops a framework that enhances caption length, diversity, and retrieval accuracy through stochastic decoding.

Authors:Zining Zhu, Liang Zhao, Kangheng Lin, Jinze Yang, En Yu, Chenglong Liu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang
Title: PerPO: Perceptual Preference Optimization via Discriminative Rewarding
Abstract:
This paper presents Perceptual Preference Optimization (PerPO), a perception alignment method aimed at addressing the visual discrimination challenges in generative pre-trained multimodal large language models (MLLMs). To align MLLMs with human visual perception process, PerPO employs discriminative rewarding to gather diverse negative samples, followed by listwise preference optimization to rank them.By utilizing the reward as a quantitative margin for ranking, our method effectively bridges generative preference optimization and discriminative empirical risk minimization. PerPO significantly enhances MLLMs' visual discrimination capabilities while maintaining their generative strengths, mitigates image-unconditional reward hacking, and ensures consistent performance across visual tasks. This work marks a crucial step towards more perceptually aligned and versatile MLLMs. We also hope that PerPO will encourage the community to rethink MLLM alignment strategies.
中文: 感知偏好优化(PerPO)是一种创新方法,通过判别式奖励和列表偏好优化增强多模态大语言模型的视觉辨别能力,使其更贴合人类感知并保持生成优势。
English: Perceptual Preference Optimization (PerPO) is a novel method that enhances multimodal large language models' visual discrimination through discriminative rewarding and listwise preference optimization, aligning them more closely with human perception while preserving generative capabilities.

Authors:Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel
Title: Orthogonal Representation Learning for Estimating Causal Quantities
Abstract:
Representation learning is widely used for estimating causal quantities (e.g., the conditional average treatment effect) from observational data. While existing representation learning methods have the benefit of allowing for end-to-end learning, they do not have favorable theoretical properties of Neyman-orthogonal learners, such as double robustness and quasi-oracle efficiency. Also, such representation learning methods often employ additional constraints, like balancing, which may even lead to inconsistent estimation. In this paper, we propose a novel class of Neyman-orthogonal learners for causal quantities defined at the representation level, which we call OR-learners. Our OR-learners have several practical advantages: they allow for consistent estimation of causal quantities based on any learned representation, while offering favorable theoretical properties including double robustness and quasi-oracle efficiency. In multiple experiments, we show that, under certain regularity conditions, our OR-learners improve existing representation learning methods and achieve state-of-the-art performance. To the best of our knowledge, our OR-learners are the first work to offer a unified framework of representation learning methods and Neyman-orthogonal learners for causal quantities estimation.
Chinese: 本文提出OR-learners,一种新型Neyman正交学习器,将表征学习与稳健因果估计相统一,在提升现有方法性能的同时,具备双重稳健性和准Oracle效率的理论优势。
English: This paper introduces OR-learners, a novel class of Neyman-orthogonal learners that unify representation learning with robust causal estimation, offering double robustness and quasi-oracle efficiency while improving upon existing methods in performance.

Authors:Muhan Lin, Shuyang Shi, Yue Guo, Vaishnav Tadiparthi, Behdad Chalaki, Ehsan Moradi Pari, Simon Stepputtis, Woojun Kim, Joseph Campbell, Katia Sycara
Title: Speaking the Language of Teamwork: LLM-Guided Credit Assignment in Multi-Agent Reinforcement Learning
Abstract:
Credit assignment, the process of attributing credit or blame to individual agents for their contributions to a team's success or failure, remains a fundamental challenge in multi-agent reinforcement learning (MARL), particularly in environments with sparse rewards. Commonly-used approaches such as value decomposition often lead to suboptimal policies in these settings, and designing dense reward functions that align with human intuition can be complex and labor-intensive. In this work, we propose a novel framework where a large language model (LLM) generates dense, agent-specific rewards based on a natural language description of the task and the overall team goal. By learning a potential-based reward function over multiple queries, our method reduces the impact of ranking errors while allowing the LLM to evaluate each agent's contribution to the overall task. Through extensive experiments, we demonstrate that our approach achieves faster convergence and higher policy returns compared to state-of-the-art MARL baselines.
Chinese: 本研究提出了一种新颖框架,利用大型语言模型根据任务描述生成密集的、针对特定智能体的奖励,通过改进信用分配来增强多智能体强化学习,并实现了更快的收敛和更高的策略回报。
English: This study introduces a novel framework using a large language model to generate dense, agent-specific rewards from task descriptions, enhancing multi-agent reinforcement learning by improving credit assignment and achieving superior convergence and policy returns.

Authors:Yunuo Chen, Junli Cao, Anil Kag, Vidit Goel, Sergei Korolev, Chenfanfu Jiang, Sergey Tulyakov, Jian Ren
Title: Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
Abstract:
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize the shape and motion of objects in the video to eliminate undesired artifacts, \eg, nonphysical deformation. Consequently, we enhance the quality of generated RGB videos and alleviate common issues like object morphing, which are prevalent in current video models due to a lack of shape awareness. With our 3D augmentation and regularization, our model is capable of handling contact-rich scenarios such as task-oriented videos. These videos involve complex interactions of solids, where 3D information is essential for perceiving deformation and contact. Furthermore, our model improves the overall quality of video generation by promoting the 3D consistency of moving objects and reducing abrupt changes in shape and motion.
中文: 本文提出了一种新颖的视频生成框架,通过将2D视频增强为3D点轨迹并微调潜在扩散模型,提升了三维几何和动态感知能力,有效减少了伪影并改善了复杂场景下的视频质量。
English: This paper introduces a novel video generation framework that enhances 3D geometry and dynamic awareness by augmenting 2D videos with 3D point trajectories and fine-tuning a latent diffusion model, effectively reducing artifacts and improving video quality in complex scenarios.

Authors:Paul Youssef, Zhixue Zhao, Daniel Braun, Jörg Schlötterer, Christin Seifert
Title: Position: Editing Large Language Models Poses Serious Safety Risks
Abstract:
Large Language Models (LLMs) contain large amounts of facts about the world. These facts can become outdated over time, which has led to the development of knowledge editing methods (KEs) that can change specific facts in LLMs with limited side effects. This position paper argues that editing LLMs poses serious safety risks that have been largely overlooked. First, we note the fact that KEs are widely available, computationally inexpensive, highly performant, and stealthy makes them an attractive tool for malicious actors. Second, we discuss malicious use cases of KEs, showing how KEs can be easily adapted for a variety of malicious purposes. Third, we highlight vulnerabilities in the AI ecosystem that allow unrestricted uploading and downloading of updated models without verification. Fourth, we argue that a lack of social and institutional awareness exacerbates this risk, and discuss the implications for different stakeholders. We call on the community to (i) research tamper-resistant models and countermeasures against malicious model editing, and (ii) actively engage in securing the AI ecosystem.
中文: 本立场文件警示,用于更新大语言模型中事实的知识编辑方法因其易获取、低成本及潜在恶意用途而构成严重安全风险,呼吁研究防篡改模型并加强AI生态系统安全。
English: This position paper warns that knowledge editing methods for updating facts in large language models pose serious safety risks due to their accessibility, low cost, and potential for malicious use, calling for research into tamper-resistant models and securing the AI ecosystem.

Authors:Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Yiran Chen, Kurt Keutzer, Chenfeng Xu
Title: Dobi-SVD: Differentiable SVD for LLM Compression and Some New Perspectives
Abstract:
We provide a new LLM-compression solution via SVD, unlocking new possibilities for LLM compression beyond quantization and pruning. We point out that the optimal use of SVD lies in truncating activations, rather than merely using activations as an optimization distance. Building on this principle, we address three critical challenges in SVD-based LLM compression: including (1) How can we determine the optimal activation truncation position for each weight matrix in LLMs? (2) How can we efficiently reconstruct the weight matrices based on truncated activations? (3) How can we address the inherent "injection" nature that results in the information loss of the SVD? We propose Dobi-SVD, which establishes a new, principled approach to SVD-based LLM compression.
中文摘要:我们提出了一种基于奇异值分解的新LLM压缩方法,通过截断激活而非将其作为优化距离,并针对关键挑战开发了Dobi-SVD解决方案。
English Summary: We introduce a novel LLM compression method using SVD that focuses on truncating activations rather than using them as optimization metrics, addressing key challenges through our proposed Dobi-SVD framework.

Authors:Zeyu Wang, Ruotong Yu, Xiangyang Wang, Jiexin Ding, Jiankai Tang, Jun Fang, Zhe He, Zhuojun Li, Tobias Röddiger, Weiye Xu, Xiyuxing Zhang, huan-ang Gao, Nan Gao, Chun Yu, Yuanchun Shi, Yuntao Wang
Title: Computing with Smart Rings: A Systematic Literature Review
Abstract:
A smart ring is a wearable electronic device in the form of a ring that incorporates diverse sensors and computing technologies to perform a variety of functions. Designed for use with fingers, smart rings are capable of sensing more subtle and abundant hand movements, thus making them a good platform for interaction. Meanwhile, fingers are abundant with blood vessels and nerve endings and accustomed to wearing rings, providing an ideal site for continuous health monitoring through smart rings, which combine comfort with the ability to capture vital biometric data, making them suitable for all-day wear. We collected in total of 206 smart ring-related publications and conducted a systematic literature review. We provide a taxonomy regarding the sensing and feedback modalities, applications, and phenomena. We review and categorize these literatures into four main areas: (1) interaction - input, (2) interaction - output, (3) passive sensing - in body feature, (4) passive sensing - out body activity. This comprehensive review highlights the current advancements within the field of smart ring and identifies potential areas for future research.
中文: 智能戒指是一种利用传感器和计算技术实现交互与健康监测的可穿戴设备,通过对206篇文献的系统综述,将其应用划分为交互和被动感知两大领域。
English: Smart rings are wearable devices that utilize sensors and computing for interaction and health monitoring, with a systematic review of 206 publications categorizing their applications into interaction and passive sensing areas.

Authors:Huakun Luo, Haixu Wu, Hang Zhou, Lanxiang Xing, Yichen Di, Jianmin Wang, Mingsheng Long
Title: Transolver++: An Accurate Neural Solver for PDEs on Million-Scale Geometries
Abstract:
Although deep models have been widely explored in solving partial differential equations (PDEs), previous works are primarily limited to data only with up to tens of thousands of mesh points, far from the million-point scale required by industrial simulations that involve complex geometries. In the spirit of advancing neural PDE solvers to real industrial applications, we present Transolver++, a highly parallel and efficient neural solver that can accurately solve PDEs on million-scale geometries. Building upon previous advancements in solving PDEs by learning physical states via Transolver, Transolver++ is further equipped with an extremely optimized parallelism framework and a local adaptive mechanism to efficiently capture eidetic physical states from massive mesh points, successfully tackling the thorny challenges in computation and physics learning when scaling up input mesh size. Transolver++ increases the single-GPU input capacity to million-scale points for the first time and is capable of continuously scaling input size in linear complexity by increasing GPUs. Experimentally, Transolver++ yields 13% relative promotion across six standard PDE benchmarks and achieves over 20% performance gain in million-scale high-fidelity industrial simulations, whose sizes are 100$\times$ larger than previous benchmarks, covering car and 3D aircraft designs.
中文摘要:Transolver++是一种高度并行的神经PDE求解器,首次实现单GPU百万级网格点处理能力,通过线性扩展在标准基准测试和工业级仿真中均取得显著性能提升。
English Summary: Transolver++ is a highly parallel neural PDE solver that enables million-scale mesh point processing on single GPUs with linear scalability, achieving significant performance improvements in both standard benchmarks and large-scale industrial simulations.

Authors:Lavanya Ratnabala, Aleksey Fedoseev, Robinroy Peter, Dzmitry Tsetserukou
Title: MAGNNET: Multi-Agent Graph Neural Network-based Efficient Task Allocation for Autonomous Vehicles with Deep Reinforcement Learning
Abstract:
This paper addresses the challenge of decentralized task allocation within heterogeneous multi-agent systems operating under communication constraints. We introduce a novel framework that integrates graph neural networks (GNNs) with a centralized training and decentralized execution (CTDE) paradigm, further enhanced by a tailored Proximal Policy Optimization (PPO) algorithm for multi-agent deep reinforcement learning (MARL). Our approach enables unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to dynamically allocate tasks efficiently without necessitating central coordination in a 3D grid environment. The framework minimizes total travel time while simultaneously avoiding conflicts in task assignments. For the cost calculation and routing, we employ reservation-based A* and R* path planners. Experimental results revealed that our method achieves a high 92.5% conflict-free success rate, with only a 7.49% performance gap compared to the centralized Hungarian method, while outperforming the heuristic decentralized baseline based on greedy approach. Additionally, the framework exhibits scalability with up to 20 agents with allocation processing of 2.8 s and robustness in responding to dynamically generated tasks, underscoring its potential for real-world applications in complex multi-agent scenarios.
中文: 本文提出了一种基于图神经网络和多智能体强化学习的去中心化任务分配框架,使无人机和无人车能在受限环境中高效协调任务分配,实现低冲突率且接近集中式方法的性能。
English: This paper presents a decentralized task allocation framework using graph neural networks and multi-agent reinforcement learning, enabling UAVs and UGVs to efficiently coordinate tasks with minimal conflicts and near-centralized performance in constrained environments.

Authors:Nan Gao, Yibin Liu, Xin Tang, Yanyan Liu, Chun Yu, Yun Huang, Yuntao Wang, Flora D. Salim, Xuhai Orson Xu, Jun Wei, Yuanchun Shi
Title: The Homework Wars: Exploring Emotions, Behaviours, and Conflicts in Parent-Child Homework Interactions
Abstract:
Parental involvement in homework is a crucial aspect of family education, but it often triggers emotional strain and conflicts. Despite growing concern over its impact on family well-being, prior research has lacked access to fine-grained, real-time dynamics of these interactions. To bridge this gap, we present a framework that leverages naturalistic parent-child interaction data and large language models (LLMs) to analyse homework conversations at scale. In a four-week in situ study with 78 Chinese families, we collected 475 hours of audio recordings and accompanying daily surveys, capturing 602 homework sessions in everyday home settings. Our LLM-based pipeline reliably extracted and categorised parental behaviours and conflict patterns from transcribed conversations, achieving high agreement with expert annotations. The analysis revealed significant emotional shifts in parents before and after homework, 18 recurring parental behaviours and seven common conflict types, with Knowledge Conflict being the most frequent. Notably, even well-intentioned behaviours were significantly positively correlated with specific conflicts. This work advances ubiquitous computing methods for studying complex family dynamics and offers empirical insights to enrich family education theory and inform more effective parenting strategies and interventions in the future.
中文: 本研究利用自然亲子互动数据和大语言模型分析家庭作业对话,揭示了78个中国家庭中显著的情绪变化、重复行为及冲突模式,为家庭教育理论和有效育儿策略提供了实证依据。
English: This study introduces a framework using naturalistic parent-child interaction data and large language models to analyze homework conversations, revealing significant emotional shifts, recurring behaviors, and conflict patterns in 78 Chinese families, with implications for family education theory and parenting strategies.

Authors:Yuxin Lin, Mengshi Qi, Liang Liu, Huadong Ma
Title: VLM-Assisted Continual learning for Visual Question Answering in Self-Driving
Abstract:
In this paper, we propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving by integrating Vision-Language Models (VLMs) with continual learning. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. However, traditional models often struggle with catastrophic forgetting when sequentially exposed to new driving tasks, such as perception, prediction, and planning, each requiring different forms of knowledge. To address this challenge, we present a novel continual learning framework that combines VLMs with selective memory replay and knowledge distillation, reinforced by task-specific projection layer regularization. The knowledge distillation allows a previously trained model to act as a "teacher" to guide the model through subsequent tasks, minimizing forgetting. Meanwhile, task-specific projection layers calculate the loss based on the divergence of feature representations, ensuring continuity in learning and reducing the shift between tasks. Evaluated on the DriveLM dataset, our framework shows substantial performance improvements, with gains ranging from 21.40% to 32.28% across various metrics. These results highlight the effectiveness of combining continual learning with VLMs in enhancing the resilience and reliability of VQA systems in autonomous driving. We will release our source code.
Chinese: 本文提出了一种结合视觉语言模型与选择性记忆回放和知识蒸馏的持续学习框架,有效解决了自动驾驶视觉问答系统中的灾难性遗忘问题,在DriveLM数据集上实现了21.40%至32.28%的性能提升。
English: This paper introduces a continual learning framework that integrates Vision-Language Models with selective memory replay and knowledge distillation to prevent catastrophic forgetting in autonomous driving Visual Question Answering systems, achieving performance improvements of 21.40% to 32.28% on the DriveLM dataset.

Authors:Haixu Wu, Yuezhou Ma, Hang Zhou, Huikun Weng, Jianmin Wang, Mingsheng Long
Title: ProPINN: Demystifying Propagation Failures in Physics-Informed Neural Networks
Abstract:
Physics-informed neural networks (PINNs) have earned high expectations in solving partial differential equations (PDEs), but their optimization usually faces thorny challenges due to the unique derivative-dependent loss function. By analyzing the loss distribution, previous research observed the propagation failure phenomenon of PINNs, intuitively described as the correct supervision for model outputs cannot ''propagate'' from initial states or boundaries to the interior domain. Going beyond intuitive understanding, this paper provides a formal and in-depth study of propagation failure and its root cause. Based on a detailed comparison with classical finite element methods, we ascribe the failure to the conventional single-point-processing architecture of PINNs and further prove that propagation failure is essentially caused by the lower gradient correlation of PINN models on nearby collocation points. Compared to superficial loss maps, this new perspective provides a more precise quantitative criterion to identify where and why PINN fails. The theoretical finding also inspires us to present a new PINN architecture, named ProPINN, which can effectively unite the gradients of region points for better propagation. ProPINN can reliably resolve PINN failure modes and significantly surpass advanced Transformer-based models with 46% relative promotion.
中文摘要:本文揭示了物理信息神经网络(PINNs)传播失败的根本原因是相邻配置点间的梯度相关性不足,并提出新型ProPINN架构通过增强区域梯度协同有效解决该问题,相比先进Transformer模型实现46%的相对性能提升。
English Summary: This paper identifies that propagation failure in Physics-Informed Neural Networks (PINNs) stems from low gradient correlation between neighboring points and introduces ProPINN, a novel architecture that enhances gradient coordination to overcome this limitation, achieving a 46% performance improvement over advanced Transformer-based models.

Authors:Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao
Title: CollabLLM: From Passive Responders to Active Collaborators
Abstract:
Large Language Models are typically trained with next-turn rewards, limiting their ability to optimize for long-term interaction. As a result, they often respond passively to ambiguous or open-ended user requests, failing to help users reach their ultimate intents and leading to inefficient conversations. To address these limitations, we introduce CollabLLM, a novel and general training framework that enhances multiturn human-LLM collaboration. Its key innovation is a collaborative simulation that estimates the long-term contribution of responses using Multiturn-aware Rewards. By reinforcement fine-tuning these rewards, CollabLLM goes beyond responding to user requests, and actively uncovers user intent and offers insightful suggestions-a key step towards more human-centered AI. We also devise a multiturn interaction benchmark with three challenging tasks such as document creation. CollabLLM significantly outperforms our baselines with averages of 18.5% higher task performance and 46.3% improved interactivity by LLM judges. Finally, we conduct a large user study with 201 judges, where CollabLLM increases user satisfaction by 17.6% and reduces user spent time by 10.4%.
中文: CollabLLM是一种创新的训练框架,通过多轮感知奖励增强人机长期协作,主动揭示用户意图并提供深刻建议,显著提升了任务完成度和用户满意度。
English: CollabLLM is a novel training framework that enhances long-term human-LLM collaboration by using multiturn-aware rewards to actively uncover user intent and provide insightful suggestions, significantly improving task performance and user satisfaction.

Authors:Jiuyang Dong, Junjun Jiang, Kui Jiang, Jiahan Li, Yongbing Zhang
Title: Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning
Abstract:
Although multi-instance learning (MIL) has succeeded in pathological image classification, it faces the challenge of high inference costs due to processing numerous patches from gigapixel whole slide images (WSIs). To address this, we propose HDMIL, a hierarchical distillation multi-instance learning framework that achieves fast and accurate classification by eliminating irrelevant patches. HDMIL consists of two key components: the dynamic multi-instance network (DMIN) and the lightweight instance pre-screening network (LIPN). DMIN operates on high-resolution WSIs, while LIPN operates on the corresponding low-resolution counterparts. During training, DMIN are trained for WSI classification while generating attention-score-based masks that indicate irrelevant patches. These masks then guide the training of LIPN to predict the relevance of each low-resolution patch. During testing, LIPN first determines the useful regions within low-resolution WSIs, which indirectly enables us to eliminate irrelevant regions in high-resolution WSIs, thereby reducing inference time without causing performance degradation. In addition, we further design the first Chebyshev-polynomials-based Kolmogorov-Arnold classifier in computational pathology, which enhances the performance of HDMIL through learnable activation layers. Extensive experiments on three public datasets demonstrate that HDMIL outperforms previous state-of-the-art methods, e.g., achieving improvements of 3.13% in AUC while reducing inference time by 28.6% on the Camelyon16 dataset.
中文摘要:提出的HDMIL框架通过分层蒸馏机制剔除无关图像块,在降低病理图像分析计算成本的同时,实现了更快推理速度与更高分类精度的双重提升。
English Summary: The proposed HDMIL framework addresses high computational costs in pathological image analysis by using hierarchical distillation to eliminate irrelevant patches, achieving both faster inference and improved classification accuracy.

Authors:Ruta Binkyte, Ivaxi Sheth, Zhijing Jin, Mohammad Havaei, Bernhard Schölkopf, Mario Fritz
Title: Causality Is Key to Understand and Balance Multiple Goals in Trustworthy ML and Foundation Models
Abstract:
Ensuring trustworthiness in machine learning (ML) systems is crucial as they become increasingly embedded in high-stakes domains. This paper advocates for integrating causal methods into machine learning to navigate the trade-offs among key principles of trustworthy ML, including fairness, privacy, robustness, accuracy, and explainability. While these objectives should ideally be satisfied simultaneously, they are often addressed in isolation, leading to conflicts and suboptimal solutions. Drawing on existing applications of causality in ML that successfully align goals such as fairness and accuracy or privacy and robustness, this paper argues that a causal approach is essential for balancing multiple competing objectives in both trustworthy ML and foundation models. Beyond highlighting these trade-offs, we examine how causality can be practically integrated into ML and foundation models, offering solutions to enhance their reliability and interpretability. Finally, we discuss the challenges, limitations, and opportunities in adopting causal frameworks, paving the way for more accountable and ethically sound AI systems.
Chinese: 本文主张将因果方法融入机器学习,以平衡公平性、隐私和准确性等相互竞争的目标,强调因果框架对于提升机器学习系统及基础模型的可信度与可解释性至关重要。
English: This paper advocates for integrating causal methods into machine learning to balance competing objectives like fairness, privacy, and accuracy, arguing that causality is essential for enhancing the trustworthiness and interpretability of ML systems and foundation models.

Authors:Mohammad Rifqi Farhansyah, Iwan Darmawan, Adryan Kusumawardhana, Genta Indra Winata, Alham Fikri Aji, Derry Tanti Wijaya
Title: Do Language Models Understand Honorific Systems in Javanese?
Abstract:
The Javanese language features a complex system of honorifics that vary according to the social status of the speaker, listener, and referent. Despite its cultural and linguistic significance, there has been limited progress in developing a comprehensive corpus to capture these variations for natural language processing (NLP) tasks. In this paper, we present Unggah-Ungguh, a carefully curated dataset designed to encapsulate the nuances of Unggah-Ungguh Basa, the Javanese speech etiquette framework that dictates the choice of words and phrases based on social hierarchy and context. Using Unggah-Ungguh, we assess the ability of language models (LMs) to process various levels of Javanese honorifics through classification and machine translation tasks. To further evaluate cross-lingual LMs, we conduct machine translation experiments between Javanese (at specific honorific levels) and Indonesian. Additionally, we explore whether LMs can generate contextually appropriate Javanese honorifics in conversation tasks, where the honorific usage should align with the social role and contextual cues. Our findings indicate that current LMs struggle with most honorific levels, exhibitinga bias toward certain honorific tiers.
中文: 本文介绍了Unggah-Ungguh数据集以解决爪哇语敬语在自然语言处理中资源匮乏的问题,发现当前语言模型在多种任务中难以准确处理和生成符合语境的敬语层级。
English: The paper introduces the Unggah-Ungguh dataset to address the scarcity of resources for Javanese honorifics in NLP, revealing that current language models struggle with accurately processing and generating context-appropriate honorific levels across various tasks.

Authors:Lance Ying, Katherine M. Collins, Lionel Wong, Ilia Sucholutsky, Ryan Liu, Adrian Weller, Tianmin Shu, Thomas L. Griffiths, Joshua B. Tenenbaum
Title: On Benchmarking Human-Like Intelligence in Machines
Abstract:
Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.
中文摘要:该立场文件批评现有AI评估方法在衡量类人认知能力方面存在不足,指出其缺乏人工验证标签、忽视人类反应多样性等缺陷,并提出五项改进建议以建立更严谨的评估标准。
English Summary: This position paper critiques current AI evaluation methods as inadequate for measuring human-like cognition, highlighting flaws like unvalidated labels and unrealistic tasks, and proposes five recommendations for more rigorous benchmarks.

Authors:Yu Yan, Sheng Sun, Zixiang Tang, Teli Liu, Min Liu
Title: Collaborative Stance Detection via Small-Large Language Model Consistency Verification
Abstract:
Stance detection on social media aims to identify attitudes expressed in tweets towards specific targets. Current studies prioritize Large Language Models (LLMs) over Small Language Models (SLMs) due to the overwhelming performance improving provided by LLMs. However, heavily relying on LLMs for stance detection, regardless of the cost, is impractical for real-world social media monitoring systems that require vast data analysis. To this end, we propose \textbf{\underline{Co}}llaborative Stance Detection via Small-Large Language Model Consistency \textbf{\underline{Ver}}ification (\textbf{CoVer}) framework, which enhances LLM utilization via context-shared batch reasoning and logical verification between LLM and SLM. Specifically, instead of processing each text individually, CoVer processes texts batch-by-batch, obtaining stance predictions and corresponding explanations via LLM reasoning in a shared context. Then, to exclude the bias caused by context noises, CoVer introduces the SLM for logical consistency verification. Finally, texts that repeatedly exhibit low logical consistency are classified using consistency-weighted aggregation of prior LLM stance predictions. Our experiments show that CoVer outperforms state-of-the-art methods across multiple benchmarks in the zero-shot setting, achieving 0.54 LLM queries per tweet while significantly enhancing performance. Our CoVer offers a more practical solution for LLM deploying for social media stance detection.
中文: CoVer框架通过大型语言模型的批量推理与小型语言模型的逻辑验证相结合,以更低成本实现了更优的立场检测性能。
English: The CoVer framework enhances stance detection by combining batch reasoning with Large Language Models and logical verification using Small Language Models, achieving superior performance with reduced computational costs.

Authors:Joshua Kazdan, Abhay Puri, Rylan Schaeffer, Lisa Yu, Chris Cundy, Jason Stanley, Sanmi Koyejo, Krishnamurthy Dvijotham
Title: No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms
Abstract:
Leading language model (LM) providers like OpenAI and Anthropic allow customers to fine-tune frontier LMs for specific use cases. To prevent abuse, these providers apply filters to block fine-tuning on overtly harmful data. In this setting, we make three contributions: First, while past work has shown that safety alignment is "shallow", we correspondingly demonstrate that existing fine-tuning attacks are shallow -- attacks target only the first several tokens of the model response, and consequently can be blocked by generating the first several response tokens with an aligned model. Second, we conceptually illustrate how to make attacks deeper by introducing a new fine-tuning attack that trains models to first refuse harmful requests before answering them; this "refuse-then-comply" strategy bypasses shallow defenses and produces harmful responses that evade output filters. Third, we demonstrate the potency of our new fine-tuning attack by jailbreaking both open-source models equipped with defenses and production models, achieving attack success rates of 57% and 72% against GPT-4o and Claude Haiku, respectively. Our attack received a $2000 bug bounty from OpenAI and was acknowledged as a vulnerability by Anthropic. Our work undermines the notion that models are safe because they initially refuse harmful requests and broadens awareness of the scope of attacks that face production fine-tuning APIs.
中文: 本研究揭示现有微调攻击具有浅层性,可通过对齐模型防御,但提出"先拒绝后执行"策略能有效突破防护生成有害内容,对GPT-4o和Claude Haiku等主流模型的攻击成功率分别达57%和72%。
English: This study reveals that current fine-tuning attacks are shallow and can be blocked by aligned models, but introduces a "refuse-then-comply" strategy that successfully bypasses defenses to produce harmful responses, achieving high attack rates against major models like GPT-4o and Claude Haiku.

Authors:Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, Junbo Zhao
Title: DataMan: Data Manager for Pre-training Large Language Models
Abstract:
The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.
Chinese: 本研究提出DataMan数据管理器,通过基于困惑度分析得出的14项质量指标筛选高质量预训练数据,实验证明其选出的数据在减少30%训练量的情况下,仍使模型在上下文学习、困惑度和指令遵循能力上显著超越基线模型。
English: This study introduces DataMan, a data manager trained to select high-quality pre-training data for large language models by applying 14 quality criteria derived from perplexity analysis, which significantly improves model performance even with less data compared to uniform sampling.

Authors:Heng Er Metilda Chee, Jiayin Wang, Zhiqiang Guo, Weizhi Ma, Qinglang Guo, Min Zhang
Title: U-Sticker: A Large-Scale Multi-Domain User Sticker Dataset for Retrieval and Personalization
Abstract:
Instant messaging with texts and stickers has become a widely adopted communication medium, enabling efficient expression of user semantics and emotions. With the increased use of stickers conveying information and feelings, sticker retrieval and recommendation has emerged as an important area of research. However, a major limitation in existing literature has been the lack of datasets capturing temporal and user-specific sticker interactions, which has hindered further progress in user modeling and sticker personalization. To address this, we introduce User-Sticker, a dataset that includes temporal and user anonymous ID across conversations. It is the largest publicly available sticker dataset to date, containing 22K unique users, 370K stickers, and 8.3M messages. The raw data was collected from a popular messaging platform from 67 conversations over 720 hours of crawling. All text and image data were carefully vetted for safety and privacy checks and modifications. Spanning 10 domains, the U-Sticker dataset captures rich temporal, multilingual, and cross-domain behaviors not previously available in other datasets. Extensive quantitative and qualitative experiments demonstrate U-Sticker's practical applications in user behavior modeling and personalized recommendation and highlight its potential to further research areas in personalized retrieval and conversational studies. U-Sticker dataset is publicly available.
中文:User-Sticker数据集填补了现有贴纸交互数据缺乏时间和用户特定信息的空白,作为最大的公开数据集,包含2.2万用户和830万条消息,有力推动了个性化贴纸推荐和用户行为建模的研究发展。
English: The User-Sticker dataset addresses the gap in temporal and user-specific sticker interaction data by providing the largest publicly available collection, featuring 22K users and 8.3M messages, to advance research in personalized sticker recommendation and user behavior modeling.

Authors:Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng
Title: M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type
Abstract:
Large language models (LLMs) are one of the most important killer computer applications. The recent algorithmic advancement proposes a fine-grained group-wise quantization for LLMs, which treats a small set (e.g., 64) of values in a tensor as a compression unit. It effectively preserves the model accuracy without retraining, and has become the standard approach to efficiently deploy LLMs. On the other hand, there are works that propose various adaptive data types to better adapt to different distributions and further reduce the required bit length for LLMs. In this work, our detailed analysis unveils a key finding that while different tensors exhibit similar distributions, small groups can have markedly different distributions. As such, the group-level diversity requires a new level of adaptivity for which existing adaptive data types fail to provide. In this paper, we propose MANT, a mathematically adaptive numeric type, featuring a more flexible encoding paradigm with a wider range of data distribution and more efficient decodingcomputation fusion mechanism to address these challenges. Based on MANT, we develop a supporting framework to assign the appropriate data type for each group adaptively. Meanwhile, the dynamically generated Key-Value (KV) caches in LLMs introduce further complexity for real-time quantization. To tackle this, we propose an efficient real-time quantization mechanism. Besides, we implement a specific processing element (PE) to efficiently support MANT and incorporate a real-time quantization unit. By integrating these components into a systolic array, MANT unifies the group-wise weight and KV cache quantization and addresses the associated challenges. Our evaluation shows achieving, on average, 2.99x (up to 4.46x) speedup and 2.81x (up to 4.10x) energy reduction to the state-of-the-art LLM accelerator.
中文: 本文提出MANT,一种数学自适应数值类型,通过灵活的编码和高效的计算解决了现有量化方法的局限性,相比最先进的大语言模型加速器,实现了显著的加速和能耗降低。
English: This paper introduces MANT, a mathematically adaptive numeric type that addresses the limitations of existing quantization methods by offering flexible encoding and efficient computation for large language models, achieving significant speedup and energy reduction compared to state-of-the-art accelerators.

Authors:Jiajun Huang, Sheng Di, Xiaodong Yu, Yujia Zhai, Zhaorui Zhang, Jinyang Liu, Xiaoyi Lu, Ken Raffenetti, Hui Zhou, Kai Zhao, Khalid Alharthi, Zizhong Chen, Franck Cappello, Yanfei Guo, Rajeev Thakur
Title: ZCCL: Significantly Improving Collective Communication With Error-Bounded Lossy Compression
Abstract:
With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communication turns out to be a critical bottleneck in large-scale distributed and parallel processing. The large message size in MPI collectives is particularly concerning because it can significantly degrade overall parallel performance. To address this issue, prior research simply applies off-the-shelf fixed-rate lossy compressors in the MPI collectives, leading to suboptimal performance, limited generalizability, and unbounded errors. In this paper, we propose a novel solution, called ZCCL, which leverages error-bounded lossy compression to significantly reduce the message size, resulting in a substantial reduction in communication costs. The key contributions are three-fold. (1) We develop two general, optimized lossy-compression-based frameworks for both types of MPI collectives (collective data movement as well as collective computation), based on their particular characteristics. Our framework not only reduces communication costs but also preserves data accuracy. (2) We customize fZ-light, an ultra-fast error-bounded lossy compressor, to meet the specific needs of collective communication. (3) We integrate ZCCL into multiple collectives, such as Allgather, Allreduce, Scatter, and Broadcast, and perform a comprehensive evaluation based on real-world scientific application datasets. Experiments show that our solution outperforms the original MPI collectives as well as multiple baselines by 1.9--8.9X.
Chinese: ZCCL采用误差有损压缩技术优化MPI集合通信,在保证数据精度的同时将通信性能提升1.9至8.9倍。
English: ZCCL introduces error-bounded lossy compression to optimize MPI collective communication, reducing message size and improving performance by 1.9–8.9 times while maintaining data accuracy.

Authors:Frederikus Hudi, Genta Indra Winata, Ruochen Zhang, Alham Fikri Aji
Title: TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning
Abstract:
Reasoning is a fundamental capability of large language models (LLMs), enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.
中文:本文提出了TextGames基准,通过基于文本的游戏评估大型语言模型的推理能力,发现尽管LLM在简单任务上表现良好并能通过自我反思改进,但在复杂挑战如排序和规则遵循方面仍远逊于人类。
English: This paper introduces TextGames, a benchmark for evaluating large language models' reasoning skills through text-based games, revealing that while LLMs perform well on easier tasks and improve with self-reflection, they struggle with complex challenges like sequencing and rule-following compared to humans.

Authors:Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S. Chatterji, Vedanuj Goswami, Sergey Edunov, Dieuwke Hupkes, Sanmi Koyejo, Sharan Narang
Title: Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Abstract:
The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.
中文: 大多数自然语言处理基准与人类评估高度相关,表明自动化指标能可靠预测人类偏好并减少昂贵的人工标注,但安全性和对抗性评估等方面存在负相关。
English: Most NLP benchmarks strongly correlate with human evaluations, showing automated metrics can reliably predict human preferences and reduce costly annotation, though some safety and adversarial aspects show anticorrelation.

Authors:Wenkai Yang, Shuming Ma, Yankai Lin, Furu Wei
Title: Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
Abstract:
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks. While current researches continue to explore the benefits of increasing test-time compute by extending the CoT lengths of Large Language Models (LLMs), we are concerned about a potential issue hidden behind the current pursuit of test-time scaling: Would excessively scaling the CoT length actually bring adverse effects to a model's reasoning performance? Our explorations on mathematical reasoning tasks reveal an unexpected finding that scaling with longer CoTs can indeed impair the reasoning performance of LLMs in certain domains. Moreover, we discover that there exists an optimal scaled length distribution that differs across different domains. Based on these insights, we propose a Thinking-Optimal Scaling strategy. Our method first uses a small set of seed data with varying response length distributions to teach the model to adopt different reasoning efforts for deep thinking. Then, the model selects its shortest correct response under different reasoning efforts on additional problems for self-improvement. Our self-improved models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks, and achieve performance on par with QwQ-32B-Preview.
中文摘要:最新研究发现过长的思维链反而会损害大语言模型的推理性能,据此提出的思维最优扩展策略通过让模型自主选择最佳推理长度实现自我改进,在数学基准测试中达到领先水平。
English Summary: Recent research finds that excessively long Chain of Thoughts can impair LLM reasoning performance, leading to a proposed Thinking-Optimal Scaling strategy that enables models to self-improve by selecting optimal reasoning lengths, achieving state-of-the-art results on math benchmarks.

Authors:Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo
Title: How Do Large Language Monkeys Get Their Power (Laws)?
Abstract:
Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts. In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts. We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge? We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own. We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ${\sim}2-4$ orders of magnitude less inference compute. Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.
Chinese: 最新研究表明,尽管单个任务在多次尝试下失败率呈指数级下降,但由于任务成功概率呈重尾分布,其中极少数极难任务扭曲了整体趋势,导致聚合性能遵循幂律规律。
English: Recent research reveals that while individual tasks exhibit exponential improvement in failure rates with increased attempts, the aggregate performance across tasks follows a power law due to a heavy-tailed distribution of success probabilities, where a small fraction of extremely difficult tasks distorts the overall trend.

Authors:Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Caiming Xiong, Shiva Kumar Pentyala, Chien-Sheng Wu
Title: Turning Conversations into Workflows: A Framework to Extract and Evaluate Dialog Workflows for Service AI Agents
Abstract:
Automated service agents require well-structured workflows to provide consistent and accurate responses to customer queries. However, these workflows are often undocumented, and their automatic extraction from conversations remains unexplored. In this work, we present a novel framework for extracting and evaluating dialog workflows from historical interactions. Our extraction process consists of two key stages: (1) a retrieval step to select relevant conversations based on key procedural elements, and (2) a structured workflow generation process using a question-answer-based chain-of-thought (QA-CoT) prompting. To comprehensively assess the quality of extracted workflows, we introduce an automated agent and customer bots simulation framework that measures their effectiveness in resolving customer issues. Extensive experiments on the ABCD and SynthABCD datasets demonstrate that our QA-CoT technique improves workflow extraction by 12.16\% in average macro accuracy over the baseline. Moreover, our evaluation method closely aligns with human assessments, providing a reliable and scalable framework for future research.
中文: 本文提出了一种通过两阶段问答链式思维提示从历史对话中提取工作流程的新框架,实现了12.16%的准确率提升,并开发出与人工评估高度契合的自动化评估系统。
English: This paper introduces a novel framework for extracting dialog workflows from historical conversations using a two-stage QA-CoT prompting method, demonstrating a 12.16% improvement in accuracy and presenting an automated evaluation system that aligns closely with human judgment.

Authors:Yao Zhang, Yuyi Mao, Hui Wang, Zhiwen Yu, Song Guo, Jun Zhang, Liang Wang, Bin Guo
Title: Orchestrating Joint Offloading and Scheduling for Low-Latency Edge SLAM
Abstract:
Visual Simultaneous Localization and Mapping (vSLAM) is a prevailing technology for many emerging robotic applications. Achieving real-time SLAM on mobile robotic systems with limited computational resources is challenging because the complexity of SLAM algorithms increases over time. This restriction can be lifted by offloading computations to edge servers, forming the emerging paradigm of edge-assisted SLAM. Nevertheless, the exogenous and stochastic input processes affect the dynamics of the edge-assisted SLAM system. Moreover, the requirements of clients on SLAM metrics change over time, exerting implicit and time-varying effects on the system. In this paper, we aim to push the limit beyond existing edge-assist SLAM by proposing a new architecture that can handle the input-driven processes and also satisfy clients' implicit and time-varying requirements. The key innovations of our work involve a regional feature prediction method for importance-aware local data processing, a configuration adaptation policy that integrates data compression/decompression and task offloading, and an input-dependent learning framework for task scheduling with constraint satisfaction. Extensive experiments prove that our architecture improves pose estimation accuracy and saves up to 47% of communication costs compared with a popular edge-assisted SLAM system, as well as effectively satisfies the clients' requirements.
中文摘要:本文提出了一种新型边缘辅助视觉SLAM架构,通过区域特征预测、自适应配置策略和输入相关学习框架,有效处理动态输入并满足客户端时变需求,在提高位姿精度的同时节省了47%的通信成本。
English Summary: The paper introduces a novel edge-assisted vSLAM architecture that addresses dynamic input processes and evolving client requirements through regional feature prediction, adaptive configuration policies, and input-dependent learning, achieving higher pose accuracy and 47% lower communication costs.

Authors:Ke Li, Fei Liu, Zhengkun Wang, Qingfu Zhang
Title: Destroy and Repair Using Hyper Graphs for Routing
Abstract:
Recent advancements in Neural Combinatorial Optimization (NCO) have shown promise in solving routing problems like the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) without handcrafted designs. Research in this domain has explored two primary categories of methods: iterative and non-iterative. While non-iterative methods struggle to generate near-optimal solutions directly, iterative methods simplify the task by learning local search steps. However, existing iterative methods are often limited by restricted neighborhood searches, leading to suboptimal results. To address this limitation, we propose a novel approach that extends the search to larger neighborhoods by learning a destroy-and-repair strategy. Specifically, we introduce a Destroy-and-Repair framework based on Hyper-Graphs (DRHG). This framework reduces consecutive intact edges to hyper-edges, allowing the model to pay more attention to the destroyed part and decrease the complexity of encoding all nodes. Experiments demonstrate that DRHG achieves stateof-the-art performance on TSP with up to 10,000 nodes and shows strong generalization to real-world TSPLib and CVRPLib problems.
Chinese: 近期神经组合优化研究提出了一种基于超图的破坏-修复框架(DRHG),通过将连续完整边简化为超边来扩展邻域搜索,在大规模路径问题上实现了最优性能,并展现出强大的泛化能力。
English: Recent Neural Combinatorial Optimization research has introduced a novel Destroy-and-Repair framework based on Hyper-Graphs (DRHG), which expands neighborhood searches by reducing intact edges to hyper-edges, achieving state-of-the-art performance on large-scale routing problems with strong generalization capabilities.

Authors:Kai Li, Fei Liu, Zhenkun Wang, Xialiang Tong, Xiongwei Han, Mingxuan Yuan, Qingfu Zhang
Title: ARS: Automatic Routing Solver with Large Language Models
Abstract:
Real-world Vehicle Routing Problems (VRPs) are characterized by a variety of practical constraints, making manual solver design both knowledge-intensive and time-consuming. Although there is increasing interest in automating the design of routing algorithms, existing research has explored only a limited array of VRP variants and fails to adequately address the complex and prevalent constraints encountered in real-world situations. To fill this gap, this paper introduces RoutBench, a benchmark of 1,000 VRP variants derived from 24 attributes, for evaluating the effectiveness of automatic routing solvers in addressing complex constraints. Along with RoutBench, we present the Automatic Routing Solver (ARS), which employs Large Language Model (LLM) agents to enhance a backbone algorithm framework by automatically generating constraint-aware heuristic code, based on problem descriptions and several representative constraints selected from a database. Our experiments show that ARS outperforms state-of-the-art LLM-based methods and commonly used solvers, automatically solving 91.67% of common VRPs and achieving at least a 30% improvement across all benchmarks.
中文摘要:本文提出了包含1000种变体的RoutBench基准测试和基于大语言模型的ARS自动求解器,该求解器能根据问题描述生成约束感知启发式代码,在实验中解决了91.67%的常见车辆路径问题,并在所有基准测试中实现至少30%的性能提升。
English Summary: This paper introduces RoutBench, a comprehensive benchmark of 1,000 VRP variants, and ARS, an automatic solver using LLM agents to generate constraint-aware heuristics, which significantly outperforms existing methods by solving 91.67% of common VRPs and improving performance by at least 30% across benchmarks.

Authors:Jianming Chang, Xin Zhou, Lulu Wang, David Lo, Bixin Li
Title: Bridging Bug Localization and Issue Fixing: A Hierarchical Localization Framework Leveraging Large Language Models
Abstract:
Automated issue fixing is a critical task in software debugging and has recently garnered significant attention from academia and industry. However, existing fixing techniques predominantly focus on the repair phase, often overlooking the importance of improving the preceding bug localization phase. As a foundational step in issue fixing, bug localization plays a pivotal role in determining the overall effectiveness of the entire process. To enhance the precision of issue fixing by accurately identifying bug locations in large-scale projects, this paper presents BugCerberus, the first hierarchical bug localization framework powered by three customized large language models. First, BugCerberus analyzes intermediate representations of bug-related programs at file, function, and statement levels and extracts bug-related contextual information from the representations. Second, BugCerberus designs three customized LLMs at each level using bug reports and contexts to learn the patterns of bugs. Finally, BugCerberus hierarchically searches for bug-related code elements through well-tuned models to localize bugs at three levels. With BugCerberus, we further investigate the impact of bug localization on the issue fixing. We evaluate BugCerberus on the widely-used benchmark SWE-bench-lite. The experimental results demonstrate that BugCerberus outperforms all baselines. Specifically, at the fine-grained statement level, BugCerberus surpasses the state-of-the-art in Top-N (N=1, 3, 5, 10) by 16.5%, 5.4%, 10.2%, and 23.1%, respectively. Moreover, in the issue fixing experiments, BugCerberus improves the fix rate of the existing issue fixing approach Agentless by 17.4% compared to the best baseline, highlighting the significant impact of enhanced bug localization on automated issue fixing.
中文:本文提出了BugCerterus,首个采用三个定制化大语言模型的分层缺陷定位框架,通过在文件、函数和语句级别精准识别缺陷来增强自动化问题修复能力,显著超越现有方法并将修复率提升17.4%。
English: This paper introduces BugCerterus, the first hierarchical bug localization framework using three customized large language models to enhance automated issue fixing by accurately identifying bugs at file, function, and statement levels, significantly outperforming existing methods and improving fix rates by 17.4%.

Authors:Cheng Li, Keyuan Zhou, Tong Liu, Yu Wang, Mingqiao Zhuang, Huan-ang Gao, Bu Jin, Hao Zhao
Title: AVD2: Accident Video Diffusion for Accident Video Description
Abstract:
Traffic accidents present complex challenges for autonomous driving, often featuring unpredictable scenarios that hinder accurate system interpretation and responses. Nonetheless, prevailing methodologies fall short in elucidating the causes of accidents and proposing preventive measures due to the paucity of training data specific to accident scenarios. In this work, we introduce AVD2 (Accident Video Diffusion for Accident Video Description), a novel framework that enhances accident scene understanding by generating accident videos that aligned with detailed natural language descriptions and reasoning, resulting in the contributed EMM-AU (Enhanced Multi-Modal Accident Video Understanding) dataset. Empirical results reveal that the integration of the EMM-AU dataset establishes state-of-the-art performance across both automated metrics and human evaluations, markedly advancing the domains of accident analysis and prevention. Project resources are available at https://an-answer-tree.github.io
中文: 本文提出AVD2框架,通过生成与自然语言描述匹配的事故视频来提升自动驾驶对交通事故的理解,贡献的EMM-AU数据集在事故分析和预防方面实现了最先进的性能。
English: This paper introduces AVD2, a framework that generates accident videos aligned with natural language descriptions to enhance autonomous driving's understanding of traffic accidents, contributing the EMM-AU dataset which achieves state-of-the-art performance in accident analysis and prevention.

Authors:Minghe Wang, Tobias Pfandzelter, Trever Schirmer, David Bermbach
Title: LLM4FaaS: No-Code Application Development using LLMs and FaaS
Abstract:
Large language models (LLMs) are powerful tools that can generate code from natural language descriptions. While this theoretically enables non-technical users to develop their own applications, they typically lack the expertise to execute, deploy, and operate generated code. This poses a barrier for such users to leverage the power of LLMs for application development. In this paper, we propose leveraging the high levels of abstraction of the Function-as-a-Service (FaaS) paradigm to handle code execution and operation for non-technical users. FaaS offers function deployment without handling the underlying infrastructure, enabling users to execute LLM-generated code without concern for its operation and without requiring any technical expertise. We propose LLM4FaaS, a novel no-code application development approach that combines LLMs and FaaS platforms to enable non-technical users to build and run their own applications using only natural language descriptions. Specifically, LLM4FaaS takes user prompts, uses LLMs to generate function code based on those prompts, and deploys these functions through a FaaS platform that handles the application's operation. LLM4FaaS also leverages the FaaS infrastructure abstractions to reduce the task complexity for the LLM, improving result accuracy. We evaluate LLM4FaaS with a proof-of-concept implementation based on GPT-4o and an open-source FaaS platform, using real prompts from non-technical users. Our evaluation based on these real user prompts demonstrates the feasibility of our approach and shows that LLM4FaaS can reliably build and deploy code in 71.47% of cases, up from 43.48% in a baseline without FaaS.
中文摘要:LLM4FaaS通过将大语言模型与函数即服务平台相结合,使非技术用户能够使用自然语言开发和部署应用程序,同时自动处理代码执行和基础设施管理。
English Summary: LLM4FaaS enables non-technical users to develop and deploy applications using natural language by combining large language models with Function-as-a-Service platforms, which handle code execution and infrastructure management automatically.

Authors:Chenyu Zhu, Yefeng Liu, Chenyang Lyu, Xue Yang, Guanhua Chen, Longyue Wang, Weihua Luo, Kaifu Zhang
Title: Towards Lightweight, Adaptive and Attribute-Aware Multi-Aspect Controllable Text Generation with Large Language Models
Abstract:
Multi-aspect controllable text generation aims to control text generation in attributes from multiple aspects, making it a complex but powerful task in natural language processing. Supervised fine-tuning methods are often employed for this task due to their simplicity and effectiveness. However, they still have some limitations: low rank adaptation (LoRA) only fine-tunes a few parameters and has suboptimal control effects, while full fine-tuning (FFT) requires significant computational resources and is susceptible to overfitting, particularly when data is limited. Moreover, existing works typically train multi-aspect controllable text generation models using only single-aspect annotated data, which results in discrepancies in data distribution; at the same time, accurately generating text with specific attributes is a challenge that requires strong attribute-aware capabilities. To address these limitations, we propose a lightweight, adaptive and attribute-aware framework for multi-aspect controllable text generation. Our framework can dynamically adjust model parameters according to different aspects of data to achieve controllable text generation, aiming to optimize performance across multiple aspects. Experimental results show that our framework outperforms other strong baselines, achieves state-of-the-art performance, adapts well to data discrepancies, and is more accurate in attribute perception.
Chinese: 本文提出了一种轻量级、自适应且属性感知的框架,通过动态调整模型参数实现多方面可控文本生成,在有效解决数据分布差异和增强属性感知精度的同时,取得了最先进的性能表现。
English: This paper introduces a lightweight, adaptive, and attribute-aware framework that dynamically adjusts model parameters for multi-aspect controllable text generation, achieving state-of-the-art performance by effectively addressing data distribution discrepancies and enhancing attribute perception accuracy.

Authors:Shuaiqun Pan, Yash J. Patel, Aneta Neumann, Frank Neumann, Thomas Bäck, Hao Wang
Title: Evolving Hard Maximum Cut Instances for Quantum Approximate Optimization Algorithms
Abstract:
Variational quantum algorithms, such as the Recursive Quantum Approximate Optimization Algorithm (RQAOA), have become increasingly popular, offering promising avenues for employing Noisy Intermediate-Scale Quantum devices to address challenging combinatorial optimization tasks like the maximum cut problem. In this study, we utilize an evolutionary algorithm equipped with a unique fitness function. This approach targets hard maximum cut instances within the latent space of a Graph Autoencoder, identifying those that pose significant challenges or are particularly tractable for RQAOA, in contrast to the classic Goemans and Williamson algorithm. Our findings not only delineate the distinct capabilities and limitations of each algorithm but also expand our understanding of RQAOA's operational limits. Furthermore, the diverse set of graphs we have generated serves as a crucial benchmarking asset, emphasizing the need for more advanced algorithms to tackle combinatorial optimization challenges. Additionally, our results pave the way for new avenues in graph generation research, offering exciting opportunities for future explorations.
中文: 本研究采用进化算法在图自编码器的潜在空间中识别对递归量子近似优化算法具有挑战性的最大割问题实例,揭示了该算法的运行边界,并为组合优化提供了重要的基准测试资源。
English: This study employs an evolutionary algorithm to identify challenging maximum cut instances for the Recursive Quantum Approximate Optimization Algorithm (RQAOA) within a Graph Autoencoder's latent space, revealing its operational boundaries and providing valuable benchmarking resources for combinatorial optimization.

Authors:Zeliang Zhang, Susan Liang, Daiki Shimada, Chenliang Xu
Title: Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives
Abstract:
While audio-visual learning equips models with a richer understanding of the real world by leveraging multiple sensory modalities, this integration also introduces new vulnerabilities to adversarial attacks. In this paper, we present a comprehensive study of the adversarial robustness of audio-visual models, considering both temporal and modality-specific vulnerabilities. We propose two powerful adversarial attacks: 1) a temporal invariance attack that exploits the inherent temporal redundancy across consecutive time segments and 2) a modality misalignment attack that introduces incongruence between the audio and visual modalities. These attacks are designed to thoroughly assess the robustness of audio-visual models against diverse threats. Furthermore, to defend against such attacks, we introduce a novel audio-visual adversarial training framework. This framework addresses key challenges in vanilla adversarial training by incorporating efficient adversarial perturbation crafting tailored to multi-modal data and an adversarial curriculum strategy. Extensive experiments in the Kinetics-Sounds dataset demonstrate that our proposed temporal and modality-based attacks in degrading model performance can achieve state-of-the-art performance, while our adversarial training defense largely improves the adversarial robustness as well as the adversarial training efficiency.
中文: 本研究探讨了视听模型对抗攻击的脆弱性,提出了时间和模态特定的攻击方法,并引入了一种新颖的对抗训练框架,显著提升了模型的鲁棒性和训练效率。
English: This study investigates the vulnerabilities of audio-visual models to adversarial attacks, introducing both temporal and modality-specific attacks while proposing a novel adversarial training framework that significantly enhances robustness and efficiency.

Authors:Zhiwen Ruan, Yixia Li, He Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang, Yun Chen, Guanhua Chen
Title: LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy
Abstract:
Despite being pretrained on multilingual corpora, large language models (LLMs) exhibit suboptimal performance on low-resource languages. Recent approaches have leveraged multilingual encoders alongside LLMs by introducing trainable parameters connecting the two models. However, these methods typically focus on the encoder's output, overlooking valuable information from other layers. We propose \aname (\mname), a framework that integrates representations from all encoder layers, coupled with the \attaname mechanism to enable layer-wise interaction between the LLM and the multilingual encoder. Extensive experiments on multilingual reasoning tasks, along with analyses of learned representations, show that our approach consistently outperforms existing baselines.
Chinese: 我们提出的\aname(\mname)框架通过整合编码器的所有层级并采用层级交互机制,有效提升了大型语言模型在低资源语言上的性能,在多语言推理任务中持续优于现有基准方法。
English: Our proposed framework, \aname (\mname), enhances LLM performance on low-resource languages by integrating all encoder layers with a layer-wise interaction mechanism, consistently outperforming existing methods in multilingual reasoning tasks.

Authors:Siyuan Huang, Zhiyuan Ma, Jintao Du, Changhua Meng, Weiqiang Wang, Jingwen Leng, Minyi Guo, Zhouhan Lin
Title: Gumbel Reranking: Differentiable End-to-End Reranker Optimization
Abstract:
RAG systems rely on rerankers to identify relevant documents. However, fine-tuning these models remains challenging due to the scarcity of annotated query-document pairs. Existing distillation-based approaches suffer from training-inference misalignment and fail to capture interdependencies among candidate documents. To overcome these limitations, we reframe the reranking process as an attention-mask problem and propose Gumbel Reranking, an end-to-end training framework for rerankers aimed at minimizing the training-inference gap. In our approach, reranker optimization is reformulated as learning a stochastic, document-wise Top-$k$ attention mask using the Gumbel Trick and Relaxed Top-$k$ Sampling. This formulation enables end-to-end optimization by minimizing the overall language loss. Experiments across various settings consistently demonstrate performance gains, including a 10.4\% improvement in recall on HotpotQA for distinguishing indirectly relevant documents.
Chinese: Gumbel 重排序是一种端到端框架,将重排序重构为注意力掩码问题,利用Gumbel技巧和松弛Top-k采样来减小训练与推理的差距,从而提升性能,如在HotpotQA上实现间接相关文档识别的召回率提高10.4%。
English: Gumbel Reranking is an end-to-end framework that reframes reranking as an attention-mask problem, using the Gumbel Trick and Relaxed Top-k Sampling to minimize training-inference gaps and improve performance, such as achieving a 10.4% recall boost on HotpotQA.

Authors:Hongye Cao, Yanming Wang, Sijia Jing, Ziyue Peng, Zhixin Bai, Zhe Cao, Meng Fang, Fan Feng, Boyan Wang, Jiaheng Liu, Tianpei Yang, Jing Huo, Yang Gao, Fanyu Meng, Xi Yang, Chao Deng, Junlan Feng
Title: SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks
Abstract:
With the rapid advancement of Large Language Models (LLMs), the safety of LLMs has been a critical concern requiring precise assessment. Current benchmarks primarily concentrate on single-turn dialogues or a single jailbreak attack method to assess the safety. Additionally, these benchmarks have not taken into account the LLM's capability of identifying and handling unsafe information in detail. To address these issues, we propose a fine-grained benchmark SafeDialBench for evaluating the safety of LLMs across various jailbreak attacks in multi-turn dialogues. Specifically, we design a two-tier hierarchical safety taxonomy that considers 6 safety dimensions and generates more than 4000 multi-turn dialogues in both Chinese and English under 22 dialogue scenarios. We employ 7 jailbreak attack strategies, such as reference attack and purpose reverse, to enhance the dataset quality for dialogue generation. Notably, we construct an innovative assessment framework of LLMs, measuring capabilities in detecting, and handling unsafe information and maintaining consistency when facing jailbreak attacks. Experimental results across 17 LLMs reveal that Yi-34B-Chat and GLM4-9B-Chat demonstrate superior safety performance, while Llama3.1-8B-Instruct and o3-mini exhibit safety vulnerabilities.
Chinese Summary: 本文提出SafeDialBench细粒度基准,通过多轮对话中的多种越狱攻击全面评估大语言模型安全性,实验发现17个测试模型间存在显著安全性能差异。
English Summary: This paper introduces SafeDialBench, a fine-grained benchmark designed to comprehensively evaluate the safety of Large Language Models in multi-turn dialogues across multiple jailbreak attacks, revealing significant performance variations among 17 tested models.

Authors:Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng
Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Abstract:
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trainable Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.
中文: NSA提出了一种原生可训练的稀疏注意力机制,通过算法创新与硬件优化相结合,在保持或超越全注意力模型性能的同时,实现了高效的长上下文建模。
English: NSA introduces a natively trainable sparse attention mechanism that combines algorithmic and hardware optimizations to enable efficient long-context modeling while maintaining or exceeding full attention performance across various benchmarks.

Authors:Li Wang, Zheng Li, Xuhong Zhang, Shouling Ji, Shanqing Guo
Title: FaceSwapGuard: Safeguarding Facial Privacy from DeepFake Threats through Identity Obfuscation
Abstract:
DeepFakes pose a significant threat to our society. One representative DeepFake application is face-swapping, which replaces the identity in a facial image with that of a victim. Although existing methods partially mitigate these risks by degrading the quality of swapped images, they often fail to disrupt the identity transformation effectively. To fill this gap, we propose FaceSwapGuard (FSG), a novel black-box defense mechanism against deepfake face-swapping threats. Specifically, FSG introduces imperceptible perturbations to a user's facial image, disrupting the features extracted by identity encoders. When shared online, these perturbed images mislead face-swapping techniques, causing them to generate facial images with identities significantly different from the original user. Extensive experiments demonstrate the effectiveness of FSG against multiple face-swapping techniques, reducing the face match rate from 90\% (without defense) to below 10\%. Both qualitative and quantitative studies further confirm its ability to confuse human perception, highlighting its practical utility. Additionally, we investigate key factors that may influence FSG and evaluate its robustness against various adaptive adversaries.
中文摘要:FaceSwapGuard(FSG)是一种黑盒防御机制,通过对人脸图像施加难以察觉的干扰,有效破坏身份特征提取,将人脸替换成功率从90%降至10%以下,并能抵御各类适应性攻击。
English Summary: FaceSwapGuard (FSG) is a black-box defense mechanism that applies imperceptible perturbations to facial images, disrupting identity features and reducing face-swapping success rates from 90% to below 10% while maintaining robustness against adaptive attacks.

Authors:Ming Liu, Hao Chen, Jindong Wang, Wensheng Zhang
Title: On the robustness of multimodal language model towards distractions
Abstract:
Although vision-language models (VLMs) have achieved significant success in various applications such as visual question answering, their resilience to prompt variations remains an under-explored area. Understanding how distractions affect VLMs is crucial for improving their real-world applicability, as inputs could have noisy and irrelevant information in many practical scenarios. This paper aims to assess the robustness of VLMs against both visual and textual distractions in the context of science question answering. Built on the ScienceQA dataset, we developed a new benchmark that introduces distractions in both the visual and textual contexts to evaluate the reasoning capacity of VLMs amid these distractions. Our findings reveal that most-of-the-art VLMs, including GPT-4, are vulnerable to various types of distractions, experiencing noticeable degradation in reasoning capabilities when confronted with distractions. Notably, models such as InternVL2 demonstrate a higher degree of robustness to these distractions. We also found that models exhibit greater sensitivity to textual distractions than visual ones. Additionally, we explored various mitigation strategies, such as prompt engineering, to counteract the impact of distractions. While these strategies improved solution accuracy, our analysis shows that there remain significant opportunities for improvement.
中文: 本研究评估了视觉语言模型在科学问答中对视觉和文本干扰的鲁棒性,揭示了它们的脆弱性以及现有缓解策略的有限效果。
English: This study evaluates the robustness of vision-language models against visual and textual distractions in science question answering, revealing their vulnerability and the limited effectiveness of current mitigation strategies.

Authors:Shuhuai Ren, Shuming Ma, Xu Sun, Furu Wei
Title: Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
Abstract:
Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.
中文: 提出的下一块预测框架通过将生成单元从单个标记转为块级预测并采用双向注意力机制,在视频生成质量上超越传统方法的同时实现了11倍的加速效果。
English: The proposed Next-Block Prediction (NBP) framework improves video generation by shifting from token-level to block-level prediction with bidirectional attention, achieving superior quality and an 11x speedup over traditional methods.

Authors:Mingkai Xu, Yongpeng Wu, Yuxuan Shi, Xiang-Gen Xia, Wenjun Zhang, Ping Zhang
Title: Learnable Residual-Based Latent Denoising in Semantic Communication
Abstract:
A latent denoising semantic communication (SemCom) framework is proposed for robust image transmission over noisy channels. By incorporating a learnable latent denoiser into the receiver, the received signals are preprocessed to effectively remove the channel noise and recover the semantic information, thereby enhancing the quality of the decoded images. Specifically, a latent denoising mapping is established by an iterative residual learning approach to improve the denoising efficiency while ensuring stable performance. Moreover, channel signal-to-noise ratio (SNR) is utilized to estimate and predict the latent similarity score (SS) for conditional denoising, where the number of denoising steps is adapted based on the predicted SS sequence, further reducing the communication latency. Finally, simulations demonstrate that the proposed framework can effectively and efficiently remove the channel noise at various levels and reconstruct visual-appealing images.
中文: 提出了一种潜在去噪语义通信框架,通过可学习的去噪器和基于信道条件的自适应去噪步骤,有效消除噪声并在各种噪声信道上重建视觉质量良好的图像。
English: A latent denoising semantic communication framework is proposed that uses a learnable denoiser and adaptive denoising steps based on channel conditions to effectively remove noise and reconstruct high-quality images over noisy channels.

Authors:Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, David Lo
Title: LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks
Abstract:
Large Language Models (LLMs) are widely utilized in software engineering (SE) tasks, such as code generation and automated program repair. However, their reliance on extensive and often undisclosed pre-training datasets raises significant concerns about data leakage, where the evaluation benchmark data is unintentionally ``seen'' by LLMs during the model's construction phase. The data leakage issue could largely undermine the validity of LLM-based research and evaluations. Despite the increasing use of LLMs in the SE community, there is no comprehensive study that assesses the extent of data leakage in SE benchmarks for LLMs yet. To address this gap, this paper presents the first large-scale analysis of data leakage in 83 SE benchmarks concerning LLMs. Our results show that in general, data leakage in SE benchmarks is minimal, with average leakage ratios of only 4.8\%, 2.8\%, and 0.7\% for Python, Java, and C/C++ benchmarks, respectively. However, some benchmarks exhibit relatively higher leakage ratios, which raises concerns about their bias in evaluation. For instance, QuixBugs and BigCloneBench have leakage ratios of 100.0\% and 55.7\%, respectively. Furthermore, we observe that data leakage has a substantial impact on LLM evaluation. We also identify key causes of high data leakage, such as the direct inclusion of benchmark data in pre-training datasets and the use of coding platforms like LeetCode for benchmark construction. To address the data leakage, we introduce \textbf{LessLeak-Bench}, a new benchmark that removes leaked samples from the 83 SE benchmarks, enabling more reliable LLM evaluations in future research. Our study enhances the understanding of data leakage in SE benchmarks and provides valuable insights for future research involving LLMs in SE.
中文: 本研究首次对83个软件工程基准中的LLM数据泄露进行大规模分析,发现平均泄露率较低但存在特定高泄露基准,并推出LessLeak-Bench基准以提升未来评估的可靠性。
English: This study conducts the first large-scale analysis of data leakage in 83 software engineering benchmarks for LLMs, revealing minimal average leakage but identifying specific high-leakage benchmarks and introducing LessLeak-Bench to ensure more reliable evaluations.

Authors:Felix Leeb, Zhijing Jin, Bernhard Schölkopf
Title: Causality can systematically address the monsters under the bench(marks)
Abstract:
Effective and reliable evaluation is essential for advancing empirical machine learning. However, the increasing accessibility of generalist models and the progress towards ever more complex, high-level tasks make systematic evaluation more challenging. Benchmarks are plagued by various biases, artifacts, or leakage, while models may behave unreliably due to poorly explored failure modes. Haphazard treatments and inconsistent formulations of such "monsters" can contribute to a duplication of efforts, a lack of trust in results, and unsupported inferences. In this position paper, we argue causality offers an ideal framework to systematically address these challenges. By making causal assumptions in an approach explicit, we can faithfully model phenomena, formulate testable hypotheses with explanatory power, and leverage principled tools for analysis. To make causal model design more accessible, we identify several useful Common Abstract Topologies (CATs) in causal graphs which help gain insight into the reasoning abilities in large language models. Through a series of case studies, we demonstrate how the precise yet pragmatic language of causality clarifies the strengths and limitations of a method and inspires new approaches for systematic progress.
中文: 本立场文件主张将因果关系作为系统性框架来解决机器学习中的评估难题,通过案例研究提出通用抽象拓扑结构来分析大语言模型的推理能力。
English: This position paper advocates for using causality as a systematic framework to address evaluation challenges in machine learning, proposing Common Abstract Topologies to analyze reasoning in large language models through case studies.

Authors:Miaomiao Li, Hao Chen, Yang Wang, Tingyuan Zhu, Weijia Zhang, Kaijie Zhu, Kam-Fai Wong, Jindong Wang
Title: Understanding and Mitigating the Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks
Abstract:
Generating synthetic datasets via large language models (LLMs) themselves has emerged as a promising approach to improve LLM performance. However, LLMs inherently reflect biases present in their training data, leading to a critical challenge: when these models generate synthetic data for training, they may propagate and amplify their inherent biases that can significantly impact model fairness and robustness on downstream tasks--a phenomenon we term bias inheritance. This work presents the first systematic investigation in understanding, analyzing, and mitigating bias inheritance. We study this problem by fine-tuning LLMs with a combined dataset consisting of original and LLM-augmented data, where bias ratio represents the proportion of augmented data. Through systematic experiments across 10 classification and generation tasks, we analyze how 6 different types of biases manifest at varying bias ratios. Our results reveal that bias inheritance has nuanced effects on downstream tasks, influencing both classification tasks and generation tasks differently. Then, our analysis identifies three key misalignment factors: misalignment of values, group data, and data distributions. Based on these insights, we propose three mitigation strategies: token-based, mask-based, and loss-based approaches. Experiments demonstrate that these strategies also work differently on various tasks and bias, indicating the substantial challenges to fully mitigate bias inheritance. We hope this work can provide valuable insights to the research of LLM data augmentation.
中文摘要:本研究首次系统探讨了大语言模型中的偏见继承现象,即合成数据生成会传播并放大模型固有偏见,并提出了三种缓解策略,这些策略在不同任务中表现出差异化效果。
English Summary: This study systematically investigates bias inheritance in large language models, where synthetic data generation propagates and amplifies inherent biases, and proposes three mitigation strategies that show varying effectiveness across different tasks.

Authors:Zhuoxun Yang, Sheng Di, Longtao Zhang, Ruoyu Li, Ximiao Li, Jiajun Huang, Jinyang Liu, Franck Cappello, Kai Zhao
Title: IPComp: Interpolation Based Progressive Lossy Compression for Scientific Applications
Abstract:
Compression is a crucial solution for data reduction in modern scientific applications due to the exponential growth of data from simulations, experiments, and observations. Compression with progressive retrieval capability allows users to access coarse approximations of data quickly and then incrementally refine these approximations to higher fidelity. Existing progressive compression solutions suffer from low reduction ratios or high operation costs, effectively undermining the approach's benefits. In this paper, we propose the first-ever interpolation-based progressive lossy compression solution that has both high reduction ratios and low operation costs. The interpolation-based algorithm has been verified as one of the best for scientific data reduction, but previously no effort exists to make it support progressive retrieval. Our contributions are three-fold: (1) We thoroughly analyze the error characteristics of the interpolation algorithm and propose our solution IPComp with multi-level bitplane and predictive coding. (2) We derive optimized strategies toward minimum data retrieval under different fidelity levels indicated by users through error bounds and bitrates. (3) We evaluate the proposed solution using six real-world datasets from four diverse domains. Experimental results demonstrate our solution archives up to $487\%$ higher compression ratios and $698\%$ faster speed than other state-of-the-art progressive compressors, and reduces the data volume for retrieval by up to $83\%$ compared to baselines under the same error bound, and reduces the error by up to $99\%$ under the same bitrate.
Chinese: 本文提出了首个基于插值的渐进式有损压缩方法IPComp,该方案同时实现高压缩比与低运算成本,在压缩效率和速度上显著优于现有技术。
English: This paper introduces IPComp, the first interpolation-based progressive lossy compression method that achieves both high compression ratios and low operational costs, significantly outperforming existing solutions in efficiency and data reduction.

Authors:Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj
Title: Masked Autoencoders Are Effective Tokenizers for Diffusion Models
Abstract:
Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.
Chinese: 最新研究表明,具有更少高斯混合模式和更强判别性特征的潜在空间结构对扩散模型性能至关重要,由此开发的MAETok自编码器在保持重建保真度的同时,实现了最先进的图像生成效果,大幅提升了训练和推理效率。
English: Recent research reveals that a well-structured latent space with fewer Gaussian Mixture modes and more discriminative features significantly enhances diffusion model performance, leading to the development of MAETok, an autoencoder that achieves state-of-the-art image generation with improved efficiency and quality.

Authors:Ming Liu, Hao Chen, Jindong Wang, Liwen Wang, Bhiksha Raj Ramakrishnan, Wensheng Zhang
Title: On Fairness of Unified Multimodal Large Language Model for Image Generation
Abstract:
Unified multimodal large language models (U-MLLMs) have demonstrated impressive performance in visual understanding and generation in an end-to-end pipeline. Compared with generation-only models (e.g., Stable Diffusion), U-MLLMs may raise new questions about bias in their outputs, which can be affected by their unified capabilities. This gap is particularly concerning given the under-explored risk of propagating harmful stereotypes. In this paper, we benchmark the latest U-MLLMs and find that most exhibit significant demographic biases, such as gender and race bias. To better understand and mitigate this issue, we propose a locate-then-fix strategy, where we audit and show how the individual model component is affected by bias. Our analysis shows that bias originates primarily from the language model. More interestingly, we observe a "partial alignment" phenomenon in U-MLLMs, where understanding bias appears minimal, but generation bias remains substantial. Thus, we propose a novel balanced preference model to balance the demographic distribution with synthetic data. Experiments demonstrate that our approach reduces demographic bias while preserving semantic fidelity. We hope our findings underscore the need for more holistic interpretation and debiasing strategies of U-MLLMs in the future.
中文: 统一多模态大语言模型存在显著的人口统计偏见,主要源于其语言模块,本文提出一种新颖的平衡偏好模型,能在保持语义保真度的同时有效减轻此类偏见。
English: Unified multimodal large language models (U-MLLMs) exhibit significant demographic biases, primarily originating from their language components, and a novel balanced preference model is proposed to mitigate these biases while maintaining semantic fidelity.

Authors:Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu
Title: ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
Abstract:
Scaling long-context ability is essential for Large Language Models (LLMs). To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency. In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with more than 12,000 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.
Chinese: ByteScale提出混合数据并行策略,通过动态网格设计和优化通信,有效处理长短序列混合训练,在万级GPU集群上实现高达7.89倍的训练加速。
English: ByteScale introduces Hybrid Data Parallelism, a dynamic framework that optimizes communication and balances computation for efficient large-scale LLM training with varying sequence lengths, achieving up to 7.89x speedup.

Authors:Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu
Title: ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs
Abstract:
Scaling long-context ability is essential for Large Language Models (LLMs). To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency. In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with more than 12,000 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.
Chinese: ByteScale提出混合数据并行策略,通过动态网格设计和优化通信,有效处理长短序列混合训练,在万级GPU集群上实现高达7.89倍的训练加速。
English: ByteScale introduces Hybrid Data Parallelism, a dynamic framework that optimizes communication and balances computation for efficient large-scale LLM training with varying sequence lengths, achieving up to 7.89x speedup.

Authors:Theofanis P. Raptis, Andrea Passarella, Marco Conti
Title: Distributed Data Access in Industrial Edge Networks
Abstract:
Wireless edge networks in smart industrial environments increasingly operate using advanced sensors and autonomous machines interacting with each other and generating huge amounts of data. Those huge amounts of data are bound to make data management (e.g., for processing, storing, computing) a big challenge. Current data management approaches, relying primarily on centralized data storage, might not be able to cope with the scalability and real time requirements of Industry 4.0 environments, while distributed solutions are increasingly being explored. In this paper, we introduce the problem of distributed data access in multi-hop wireless industrial edge deployments, whereby a set of consumer nodes needs to access data stored in a set of data cache nodes, satisfying the industrial data access delay requirements and at the same time maximizing the network lifetime. We prove that the introduced problem is computationally intractable and, after formulating the objective function, we design a two-step algorithm in order to address it. We use an open testbed with real devices for conducting an experimental investigation on the performance of the algorithm. Then, we provide two online improvements, so that the data distribution can dynamically change before the first node in the network runs out of energy. We compare the performance of the methods via simulations for different numbers of network nodes and data consumers, and we show significant lifetime prolongation and increased energy efficiency when employing the method which is using only decentralized low-power wireless communication instead of the method which is using also centralized local area wireless communication.
中文摘要:本文针对工业无线边缘网络中的分布式数据访问难题,提出一种两阶段算法,在满足延迟要求的同时最大化网络寿命,并通过实验和仿真验证了该纯分布式低功耗方案相比混合通信方案具有显著更优的能耗表现。
English Summary: This paper addresses the challenge of distributed data access in industrial wireless edge networks by proposing a two-step algorithm that maximizes network lifetime while meeting delay requirements, demonstrating through experiments and simulations its superior energy efficiency over centralized approaches.

Authors:Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, Bin Cui
Title: Training-free and Adaptive Sparse Attention for Efficient Long Video Generation
Abstract:
Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K tokens) with HunyuanVideo takes about 600 PFLOPs, with around 500 PFLOPs consumed by attention computations. To address this issue, we propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method. Firstly, to realize the Dynamic Pattern, we introduce a blockified pattern to efficiently capture the hierarchical sparsity inherent in DiTs. This is based on our observation that sparse characteristics of DiTs exhibit hierarchical and blockified structures between and within different modalities. This blockified approach significantly reduces the complexity of attention computation while maintaining high fidelity in the generated videos. Secondly, to enable Online Precise Search, we propose the Fused LSE-Cached Search with Head-adaptive Hierarchical Block Sparse Attention. This method is motivated by our finding that DiTs' sparse pattern and LSE vary w.r.t. inputs, layers, and heads, but remain invariant across denoising steps. By leveraging this invariance across denoising steps, it adapts to the dynamic nature of DiTs and allows for precise, real-time identification of sparse indices with minimal overhead. AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs, requiring neither additional fine-tuning nor a dataset-dependent profiling. Extensive experiments validate that AdaSpa delivers substantial acceleration across various models while preserving video quality, establishing itself as a robust and scalable approach to efficient video generation.
中文摘要:AdaSpa提出了一种动态稀疏注意力方法,通过利用层次化稀疏模式和在线搜索技术,显著降低了扩散变换器在视频生成中的计算延迟,在保持视频质量的同时实现了高效加速。
English Summary: AdaSpa introduces a dynamic sparse attention method that reduces computational latency in Diffusion Transformers for video generation by leveraging hierarchical sparsity patterns and online search techniques, achieving significant acceleration without compromising video quality.

Authors:Yuqian Chen, Leo Zekelman, Yui Lo, Suheyla Cetin-Karayumak, Tengfei Xue, Yogesh Rathi, Nikos Makris, Fan Zhang, Weidong Cai, Lauren J. O'Donnell
Title: TractCloud-FOV: Deep Learning-based Robust Tractography Parcellation in Diffusion MRI with Incomplete Field of View
Abstract:
Tractography parcellation classifies streamlines reconstructed from diffusion MRI into anatomically defined fiber tracts for clinical and research applications. However, clinical scans often have incomplete fields of view (FOV) where brain regions are partially imaged, leading to partial or truncated fiber tracts. To address this challenge, we introduce TractCloud-FOV, a deep learning framework that robustly parcellates tractography under conditions of incomplete FOV. We propose a novel training strategy, FOV-Cut Augmentation (FOV-CA), in which we synthetically cut tractograms to simulate a spectrum of real-world inferior FOV cutoff scenarios. This data augmentation approach enriches the training set with realistic truncated streamlines, enabling the model to achieve superior generalization. We evaluate the proposed TractCloud-FOV on both synthetically cut tractography and two real-life datasets with incomplete FOV. TractCloud-FOV significantly outperforms several state-of-the-art methods on all testing datasets in terms of streamline classification accuracy, generalization ability, tract anatomical depiction, and computational efficiency. Overall, TractCloud-FOV achieves efficient and consistent tractography parcellation in diffusion MRI with incomplete FOV.
Chinese: TractCloud-FOV 是一种深度学习框架,通过视野切割增强技术,有效处理视野不完整的扩散MRI纤维追踪分割,在准确性和效率上均优于现有方法。
English: TractCloud-FOV is a deep learning framework that effectively parcellates tractography in diffusion MRI with incomplete fields of view, using FOV-Cut Augmentation to enhance generalization and outperform existing methods in accuracy and efficiency.

Authors:Shaheer Mohamed, Tharindu Fernando, Sridha Sridharan, Peyman Moghadam, Clinton Fookes
Title: Spectral-Enhanced Transformers: Leveraging Large-Scale Pretrained Models for Hyperspectral Object Tracking
Abstract:
Hyperspectral object tracking using snapshot mosaic cameras is emerging as it provides enhanced spectral information alongside spatial data, contributing to a more comprehensive understanding of material properties. Using transformers, which have consistently outperformed convolutional neural networks (CNNs) in learning better feature representations, would be expected to be effective for Hyperspectral object tracking. However, training large transformers necessitates extensive datasets and prolonged training periods. This is particularly critical for complex tasks like object tracking, and the scarcity of large datasets in the hyperspectral domain acts as a bottleneck in achieving the full potential of powerful transformer models. This paper proposes an effective methodology that adapts large pretrained transformer-based foundation models for hyperspectral object tracking. We propose an adaptive, learnable spatial-spectral token fusion module that can be extended to any transformer-based backbone for learning inherent spatial-spectral features in hyperspectral data. Furthermore, our model incorporates a cross-modality training pipeline that facilitates effective learning across hyperspectral datasets collected with different sensor modalities. This enables the extraction of complementary knowledge from additional modalities, whether or not they are present during testing. Our proposed model also achieves superior performance with minimal training iterations.
中文摘要:本文提出了一种利用预训练变换器的自适应方法,通过空间-光谱令牌融合模块和跨模态训练解决高光谱目标跟踪中的数据稀缺问题,以最少的训练迭代实现卓越性能。
English Summary: This paper introduces an adaptive method using pretrained transformers with a spatial-spectral token fusion module and cross-modality training to overcome data scarcity in hyperspectral object tracking, achieving high performance with minimal training.

Authors:Jake Poznanski, Aman Rangapur, Jon Borchardt, Jason Dunkelberger, Regan Huff, Daniel Lin, Aman Rangapur, Christopher Wilhelm, Kyle Lo, Luca Soldaini
Title: olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Abstract:
PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. Traditional open source tools often produce lower quality extractions compared to vision language models (VLMs), but reliance on the best VLMs can be prohibitively costly (e.g., over 6,240 USD per million PDF pages for GPT-4o) or infeasible if the PDFs cannot be sent to proprietary APIs. We present olmOCR, an open-source toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content like sections, tables, lists, equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on olmOCR-mix-0225, a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized for large-scale batch processing, able to scale flexibly to different hardware setups and can convert a million PDF pages for only 176 USD. To aid comparison with existing systems, we also introduce olmOCR-Bench, a curated set of 1,400 PDFs capturing many content types that remain challenging even for the best tools and VLMs, including formulas, tables, tiny fonts, old scans, and more. We find olmOCR outperforms even top VLMs including GPT-4o, Gemini Flash 2 and Qwen-2.5-VL. We openly release all components of olmOCR: our fine-tuned VLM model, training code and data, an efficient inference pipeline that supports vLLM and SGLang backends, and benchmark olmOCR-Bench.
中文: PDF文档虽为语言模型提供海量训练数据,但其多样格式给内容提取带来挑战;开源工具olmOCR通过微调视觉语言模型,以低成本将PDF高效转换为结构化文本,其性能甚至超越顶尖商业模型。
English: PDF documents offer vast training data for language models but present extraction challenges due to diverse formats, which the open-source olmOCR toolkit addresses by using a fine-tuned vision language model to efficiently convert PDFs into structured text at low cost while outperforming leading proprietary models.

Authors:Samuele Sabella, Chiara Boldrini, Lorenzo Valerio, Andrea Passarella, Marco Conti
Title: The Built-In Robustness of Decentralized Federated Averaging to Bad Data
Abstract:
Decentralized federated learning (DFL) enables devices to collaboratively train models over complex network topologies without relying on a central controller. In this setting, local data remains private, but its quality and quantity can vary significantly across nodes. The extent to which a fully decentralized system is vulnerable to poor-quality or corrupted data remains unclear, but several factors could contribute to potential risks. Without a central authority, there can be no unified mechanism to detect or correct errors, and each node operates with a localized view of the data distribution, making it difficult for the node to assess whether its perspective aligns with the true distribution. Moreover, models trained on low-quality data can propagate through the network, amplifying errors. To explore the impact of low-quality data on DFL, we simulate two scenarios with degraded data quality -- one where the corrupted data is evenly distributed in a subset of nodes and one where it is concentrated on a single node -- using a decentralized implementation of FedAvg. Our results reveal that averaging-based decentralized learning is remarkably robust to localized bad data, even when the corrupted data resides in the most influential nodes of the network. Counterintuitively, this robustness is further enhanced when the corrupted data is concentrated on a single node, regardless of its centrality in the communication network topology. This phenomenon is explained by the averaging process, which ensures that no single node -- however central -- can disproportionately influence the overall learning process.
中文: 去中心化联邦学习对局部低质量数据表现出强大鲁棒性,其平均机制能防止任何单一节点过度影响模型,即使损坏数据集中在关键节点上也是如此。
English: Decentralized federated learning demonstrates strong resilience to localized poor-quality data, as the averaging mechanism prevents any single node from disproportionately affecting the model, even when corrupted data is concentrated on influential nodes.

Authors:Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
Title: GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
Abstract:
Despite the growing interest in jailbreak methods as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. We conduct a systematic measurement study based on 37 jailbreak studies since 2022, focusing on both the methods and the evaluation systems they employ. We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. This paper advocates a shift to a more nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines -- GuidedEval. Experiments demonstrate that GuidedBench offers more accurate measurements of jailbreak performance, enabling meaningful comparisons across methods and uncovering new insights overlooked in previous evaluations. GuidedEval reduces inter-evaluator variance by at least 76.03\%. Furthermore, we observe that incorporating guidelines can enhance the effectiveness of jailbreak methods themselves, offering new insights into both attack strategies and evaluation paradigms.
中文摘要:本研究批评了现有大语言模型越狱评估系统的缺陷,提出采用个案化指南的GuidedBench新基准,显著提升评估准确性并降低评估者差异,同时发现指南还能增强越狱方法本身的有效性。
English Summary: This study critiques current jailbreak evaluation systems for LLMs, proposing GuidedBench with case-specific guidelines to improve accuracy and reduce evaluator variance, while also revealing that guidelines can enhance jailbreak effectiveness.

Authors:Hang Jiang, Tal August, Luca Soldaini, Kyle Lo, Maria Antoniak
Title: Automatic Detection of Research Values from Scientific Abstracts Across Computer Science Subfields
Abstract:
The field of Computer science (CS) has rapidly evolved over the past few decades, providing computational tools and methodologies to various fields and forming new interdisciplinary communities. This growth in CS has significantly impacted institutional practices and relevant research communities. Therefore, it is crucial to explore what specific research values, known as basic and fundamental beliefs that guide or motivate research attitudes or actions, CS-related research communities promote. Prior research has manually analyzed research values from a small sample of machine learning papers. No prior work has studied the automatic detection of research values in CS from large-scale scientific texts across different research subfields. This paper introduces a detailed annotation scheme featuring ten research values that guide CS-related research. Based on the scheme, we build value classifiers to scale up the analysis and present a systematic study over 226,600 paper abstracts from 32 CS-related subfields and 86 popular publishing venues over ten years.
中文摘要:本文提出了一套包含十个研究价值的详细标注方案,通过构建价值分类器对来自32个计算机子领域的22.66万篇论文摘要进行了十年期系统分析,填补了大规模自动检测研究价值的空白。
English Summary: This paper introduces a comprehensive annotation framework for ten research values in computer science and develops classifiers to analyze their prevalence across 226,600 abstracts from diverse subfields over a decade, addressing the gap in automated large-scale value detection.

Authors:Theofanis P. Raptis, Andrea Passarella, Marco Conti
Title: Energy Efficient Network Path Reconfiguration for Industrial Field Data
Abstract:
Energy efficiency and reliability are vital design requirements of recent industrial networking solutions. Increased energy consumption, poor data access rates and unpredictable end-to-end data access latencies are catastrophic when transferring high volumes of critical industrial data in strict temporal deadlines. These requirements might become impossible to meet later on, due to node failures, or excessive degradation of the performance of wireless links. In this paper, we focus on maintaining the network functionality required by the industrial, best effort, low-latency applications after such events, by sacrificing latency guarantees to improve energy consumption and reliability. We avoid continuously recomputing the network configuration centrally, by designing an energy efficient, local and distributed path reconfiguration method. Specifically, given the operational parameters required by the applications, our method locally reconfigures the data distribution paths, when a network node fails. Additionally, our method also regulates the return to an operational state of nodes that have been offline in the past. We compare the performance of our method through simulations to the performance of other state of the art protocols and we demonstrate performance gains in terms of energy consumption, data delivery success rate, and in some cases, end-to-end data access latency. We conclude by providing some emerging key insights which can lead to further performance improvements.
中文摘要:本文提出一种分布式路径重配置方法,通过牺牲延迟保证来提升能耗效率与可靠性,从而维持工业网络的运行功能,仿真结果表明该方法在能耗和数据传输成功率方面优于现有协议。
English Summary: This paper introduces a distributed path reconfiguration method that maintains industrial network functionality by prioritizing energy efficiency and reliability over latency guarantees, demonstrating performance gains through simulations compared to existing protocols.

Authors:Bin Feng, Shulan Ruan, Mingzheng Yang, Dongxuan Han, Huijie Liu, Kai Zhang, Qi Liu
Title: SentiFormer: Metadata Enhanced Transformer for Image Sentiment Analysis
Abstract:
As more and more internet users post images online to express their daily emotions, image sentiment analysis has attracted increasing attention. Recently, researchers generally tend to design different neural networks to extract visual features from images for sentiment analysis. Despite the significant progress, metadata, the data (e.g., text descriptions and keyword tags) for describing the image, has not been sufficiently explored in this task. In this paper, we propose a novel Metadata Enhanced Transformer for sentiment analysis (SentiFormer) to fuse multiple metadata and the corresponding image into a unified framework. Specifically, we first obtain multiple metadata of the image and unify the representations of diverse data. To adaptively learn the appropriate weights for each metadata, we then design an adaptive relevance learning module to highlight more effective information while suppressing weaker ones. Moreover, we further develop a cross-modal fusion module to fuse the adaptively learned representations and make the final prediction. Extensive experiments on three publicly available datasets demonstrate the superiority and rationality of our proposed method.
中文: 本文提出SentiFormer,一种通过自适应相关性学习和跨模态融合将多种元数据与图像整合的元数据增强Transformer,在公开数据集上验证了其在情感分析中的优越性。
English: This paper introduces SentiFormer, a Metadata Enhanced Transformer that integrates multiple metadata with images using adaptive relevance learning and cross-modal fusion to improve sentiment analysis, demonstrating superior performance on public datasets.

Authors:Loreto Pescosolido, Andrea Passarella, Marco Conti
Title: Optimal Popularity-based Transmission Range Selection for D2D-supported Content Delivery
Abstract:
Considering device-to-device (D2D) wireless links as a virtual extension of 5G (and beyond) cellular networks to deliver popular contents has been proposed as an interesting approach to reduce energy consumption, congestion, and bandwidth usage at the network edge. In the scenario of multiple users in a region independently requesting some popular content, there is a major potential for energy consumption reduction exploiting D2D communications. In this scenario, we consider the problem of selecting the maximum allowed transmission range (or equivalently the maximum transmit power) for the D2D links that support the content delivery process. We show that, for a given maximum allowed D2D energy consumption, a considerable reduction of the cellular infrastructure energy consumption can be achieved by selecting the maximum D2D transmission range as a function of content class parameters such as popularity and delay-tolerance, compared to a uniform selection across different content classes. Specifically, we provide an analytical model that can be used to estimate the energy consumption (for small delay tolerance) and thus to set the optimal transmission range. We validate the model via simulations and study the energy gain that our approach allows to obtain. Our results show that the proposed approach to the maximum D2D transmission range selection allows a reduction of the overall energy consumption in the range of 30% to 55%, compared to a selection of the maximum D2D transmission range oblivious to popularity and delay tolerance.
中文: 研究表明,基于内容流行度和延迟容忍度优化D2D最大传输范围,相比统一范围选择可降低30%至55%的总能耗,该结论通过仿真验证的分析模型得出。
English: The study demonstrates that optimizing the maximum D2D transmission range based on content popularity and delay tolerance can reduce overall energy consumption by 30% to 55% compared to uniform range selection, using an analytical model validated through simulations.

Authors:Yicheng Lang, Kehan Guo, Yue Huang, Yujun Zhou, Haomin Zhuang, Tianyu Yang, Yao Su, Xiangliang Zhang
Title: Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis
Abstract:
Due to the widespread use of LLMs and the rising critical ethical and safety concerns, LLM unlearning methods have been developed to remove harmful knowledge and undesirable capabilities. In this context, evaluations are mostly based on single-value metrics such as QA accuracy. However, these metrics often fail to capture the nuanced retention of harmful knowledge components, making it difficult to assess the true effectiveness of unlearning. To address this issue, we propose UNCD (UNlearning evaluation via Cognitive Diagnosis), a novel framework that leverages Cognitive Diagnosis Modeling for fine-grained evaluation of LLM unlearning. Our dedicated benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities. Moreover, we introduce UNCD-Agent, which refines unlearning by diagnosing knowledge remnants and generating targeted unlearning data. Extensive experiments across eight unlearning methods and two base models demonstrate that UNCD not only enhances evaluation but also effectively facilitates the removal of harmful LLM abilities.
Chinese: 针对大模型遗忘评估中单一指标无法捕捉有害知识残留的问题,提出UNCD框架,通过认知诊断实现细粒度评估并生成针对性遗忘数据,有效提升有害能力的消除效果。
English: To address the limitations of single-value metrics in evaluating LLM unlearning, the UNCD framework is introduced, employing cognitive diagnosis for fine-grained assessment and targeted data generation to effectively remove harmful knowledge.

Authors:Yatin Dandi, Luca Pesce, Lenka Zdeborová, Florent Krzakala
Title: The Computational Advantage of Depth: Learning High-Dimensional Hierarchical Functions with Gradient Descent
Abstract:
Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD successively reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms.
中文总结:通过梯度下降训练的深度网络能够逐步降低问题维度来学习分层目标函数,相比浅层网络大幅减少了所需样本量。
English Summary: Deep networks trained by gradient descent learn hierarchical targets by progressively reducing dimensionality, requiring far fewer samples than shallow networks for effective learning.

Authors:Yilong Chen, Junyuan Shang, Zhenyu Zhang, Yanxi Xie, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Title: Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking
Abstract:
Large language models (LLMs) face inherent performance bottlenecks under parameter constraints, particularly in processing critical tokens that demand complex reasoning. Empirical analysis reveals challenging tokens induce abrupt gradient spikes across layers, exposing architectural stress points in standard Transformers. Building on this insight, we propose Inner Thinking Transformer (ITT), which reimagines layer computations as implicit thinking steps. ITT dynamically allocates computation through Adaptive Token Routing, iteratively refines representations via Residual Thinking Connections, and distinguishes reasoning phases using Thinking Step Encoding. ITT enables deeper processing of critical tokens without parameter expansion. Evaluations across 162M-466M parameter models show ITT achieves 96.5\% performance of a 466M Transformer using only 162M parameters, reduces training data by 43.2\%, and outperforms Transformer/Loop variants in 11 benchmarks. By enabling elastic computation allocation during inference, ITT balances performance and efficiency through architecture-aware optimization of implicit thinking pathways.
中文: 内思变换器(ITT)通过自适应令牌路由动态分配计算资源,并利用残差思维连接迭代优化关键令牌的表征,在减少参数和训练数据的同时实现接近大型模型的性能表现。
English: The Inner Thinking Transformer (ITT) addresses performance bottlenecks in large language models by dynamically allocating computation to critical tokens and refining representations through implicit thinking steps, achieving near-equivalent performance with fewer parameters and training data.

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Title: Demystifying Multilingual Chain-of-Thought in Process Reward Modeling
Abstract:
Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.
中文: 本文通过在多语言数据集上训练过程奖励模型,将其应用于大型语言模型以提升多语言复杂推理能力,结果表明该方法不仅能提高平均准确率、减少早期推理错误,还揭示了训练语言数量与数据规模对性能的影响。
English: This paper introduces multilingual process reward models (PRMs) trained on translated datasets to enhance complex reasoning in large language models across multiple languages, showing improved accuracy and reduced errors while highlighting the impact of training data scale and model parameters.

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Title: Demystifying Multilingual Chain-of-Thought in Process Reward Modeling
Abstract:
Large language models (LLMs) are designed to perform a wide range of tasks. To improve their ability to solve complex problems requiring multi-step reasoning, recent research leverages process reward modeling to provide fine-grained feedback at each step of the reasoning process for reinforcement learning (RL), but it predominantly focuses on English. In this paper, we tackle the critical challenge of extending process reward models (PRMs) to multilingual settings. To achieve this, we train multilingual PRMs on a dataset spanning seven languages, which is translated from English. Through comprehensive evaluations on two widely used reasoning benchmarks across 11 languages, we demonstrate that multilingual PRMs not only improve average accuracy but also reduce early-stage reasoning errors. Furthermore, our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data, while also uncovering the benefits arising from more candidate responses and trainable parameters. This work opens promising avenues for robust multilingual applications in complex, multi-step reasoning tasks. In addition, we release the code to foster research along this line.
中文: 本文通过在多语言数据集上训练过程奖励模型,将其应用于大型语言模型以提升多语言复杂推理能力,结果表明该方法不仅能提高平均准确率、减少早期推理错误,还揭示了训练语言数量与数据规模对性能的影响。
English: This paper introduces multilingual process reward models (PRMs) trained on translated datasets to enhance complex reasoning in large language models across multiple languages, showing improved accuracy and reduced errors while highlighting the impact of training data scale and model parameters.

Authors:Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, Linfeng Zhang
Title: Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Abstract:
Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.
中文摘要:本文针对多模态大语言模型中现有令牌剪枝方法在设计和评估上的根本问题提出批判,旨在通过逐一解答这些核心疑问,为未来高效模型的令牌剪枝技术发展提供理论指导。
English Summary: This abstract critiques current token pruning methods in multimodal large language models for their unresolved design and evaluation issues, proposing to address these fundamental questions to guide future efficient model development.

Authors:Aili Chen, Chengyu Du, Jiangjie Chen, Jinghan Xu, Yikai Zhang, Siyu Yuan, Zulong Chen, Liangyue Li, Yanghua Xiao
Title: DEEPER Insight into Your User: Directed Persona Refinement for Dynamic Persona Modeling
Abstract:
To advance personalized applications such as recommendation systems and user behavior prediction, recent research increasingly adopts large language models (LLMs) for human -readable persona modeling. In dynamic real -world scenarios, effective persona modeling necessitates leveraging streaming behavior data to continually optimize user personas. However, existing methods -whether regenerating personas or incrementally extending them with new behaviors -often fail to achieve sustained improvements in persona quality or future behavior prediction accuracy. To address this, we propose DEEPER, a novel approach for dynamic persona modeling that enables continual persona optimization. Specifically, we enhance the model's direction -search capability through an iterative reinforcement learning framework, allowing it to automatically identify effective update directions and optimize personas using discrepancies between user behaviors and model predictions. Extensive experiments on dynamic persona modeling involving 4800 users across 10 domains highlight the superior persona optimization capabilities of DEEPER, delivering an impressive 32.2% average reduction in user behavior prediction error over four update rounds -outperforming the best baseline by a remarkable 22.92%.
中文摘要:研究者提出DEEPER方法,通过强化学习框架利用行为与预测差异持续优化动态用户画像,在跨领域实验中使行为预测误差平均降低32.2%,显著优于现有基线模型。
English Summary: Researchers introduce DEEPER, a reinforcement learning-based method for dynamic persona modeling that continuously optimizes user personas by leveraging behavior-prediction discrepancies, achieving a 32.2% reduction in prediction error across multiple domains.

Authors:Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini
Title: Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
Abstract:
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
中文: 本文提出WebOrganizer框架,通过主题和格式对网络数据进行领域划分,利用大模型蒸馏标注实现领域混合策略,证明该方法能有效提升下游任务性能,并与基于质量的数据筛选方法形成互补。
English: This paper introduces WebOrganizer, a framework that organizes web data into domains by topic and format using distilled LLM annotations, showing that strategic domain mixing enhances model performance and complements quality-based data curation methods.

Authors:Tharindu Fernando, Darshana Priyasad, Sridha Sridharan, Arun Ross, Clinton Fookes
Title: Face Deepfakes -- A Comprehensive Review
Abstract:
In recent years, remarkable advancements in deep-fake generation technology have led to unprecedented leaps in its realism and capabilities. Despite these advances, we observe a notable lack of structured and deep analysis deepfake technology. The principal aim of this survey is to contribute a thorough theoretical analysis of state-of-the-art face deepfake generation and detection methods. Furthermore, we provide a coherent and systematic evaluation of the implications of deepfakes on face biometric recognition approaches. In addition, we outline key applications of face deepfake technology, elucidating both positive and negative applications of the technology, provide a detailed discussion regarding the gaps in existing research, and propose key research directions for further investigation.
Chinese: 本综述对先进的人脸深度伪造生成与检测方法进行了全面的理论分析和系统评估,同时概述了其应用场景、研究空白及未来研究方向。
English: This survey provides a comprehensive theoretical analysis and systematic evaluation of advanced face deepfake generation and detection methods, while also outlining their applications, research gaps, and future directions.

Authors:Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki
Title: ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
Abstract:
Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is crucial for addressing the GPU shortage problem and being more cost-effective. However, the diversity of network environments and various GPU types on the cloud bring difficulties to achieving high-performance serving. In this work, we propose ThunderServe, a high-performance and cost-efficient LLM serving system for heterogeneous cloud environments. We introduce a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments. Furthermore, we propose a lightweight re-scheduling mechanism, designed to adapt to fluctuating online conditions (e.g., node failures, workload shifts) without the need for costly restarts of ongoing services. Empirical results in both heterogeneous cloud and homogeneous in-house environments reveal that ThunderServe delivers up to a 2.1$\times$ and on average a $1.7\times$ increase in throughput and achieves up to a 2.5$\times$ and on average a $1.5\times$ reduction in latency deadlines compared with state-of-the-art systems given the same price budget, suggesting opting for cloud services provides a more cost-efficient solution.
中文: ThunderServe 是一种专为异构云环境设计的高性能大语言模型服务系统,通过创新的调度算法和轻量级重调度机制,在相同预算下显著提升了吞吐量并降低了延迟,优于现有最优系统。
English: ThunderServe is a high-performance LLM serving system designed for heterogeneous cloud environments, utilizing a novel scheduling algorithm and lightweight re-scheduling to significantly improve throughput and reduce latency compared to state-of-the-art systems within the same budget.

Authors:Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, Yanghua Xiao
Title: CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Abstract:
Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.
中文: 角色扮演语言代理在模拟既定角色时面临真实数据集和细致评估方法缺失的挑战,而CoSER框架通过高质量数据集、开源模型和评估协议解决了这一问题,使模型在多个基准测试中达到或超越了GPT-4o的先进性能。
English: Role-playing language agents face challenges in simulating established characters due to the lack of authentic datasets and nuanced evaluation methods, but the CoSER framework addresses this with a high-quality dataset, open models, and an evaluation protocol that enables advanced performance matching or surpassing GPT-4o on multiple benchmarks.

Authors:Yue Huang, Chujie Gao, Yujun Zhou, Kehan Guo, Xiangqi Wang, Or Cohen-Sasson, Max Lamparth, Xiangliang Zhang
Title: Position: We Need An Adaptive Interpretation of Helpful, Honest, and Harmless Principles
Abstract:
The Helpful, Honest, and Harmless (HHH) principle is a foundational framework for aligning AI systems with human values. However, existing interpretations of the HHH principle often overlook contextual variability and conflicting requirements across applications. In this paper, we argue for an adaptive interpretation of the HHH principle and propose a reference framework for its adaptation to diverse scenarios. We first examine the principle's foundational significance and identify ambiguities and conflicts through case studies of its dimensions. To address these challenges, we introduce the concept of priority order, which provides a structured approach for balancing trade-offs among helpfulness, honesty, and harmlessness. Further, we explore the interrelationships between these dimensions, demonstrating how harmlessness and helpfulness can be jointly enhanced and analyzing their interdependencies in high-risk evaluations. Building on these insights, we propose a reference framework that integrates context definition, value prioritization, risk assessment, and benchmarking standards to guide the adaptive application of the HHH principle. This work offers practical insights for improving AI alignment, ensuring that HHH principles remain both ethically grounded and operationally effective in real-world AI deployment.
中文摘要:本文主张对人工智能对齐中的有益、诚实和无害(HHH)原则进行适应性解读,提出了一个整合情境定义、价值排序和风险评估的结构化框架,以解决原则模糊性并提升其在不同场景中的实际应用效能。
English Summary: This paper advocates for an adaptive interpretation of the Helpful, Honest, and Harmless (HHH) principle in AI alignment, proposing a structured framework that incorporates context definition, value prioritization, and risk assessment to address ambiguities and enhance operational effectiveness across diverse scenarios.

Authors:Jingheng Ye, Shen Wang, Deqing Zou, Yibo Yan, Kun Wang, Hai-Tao Zheng, Ruitong Liu, Zenglin Xu, Irwin King, Philip S. Yu, Qingsong Wen
Title: Position: LLMs Can be Good Tutors in English Education
Abstract:
While recent efforts have begun integrating large language models (LLMs) into English education, they often rely on traditional approaches to learning tasks without fully embracing educational methodologies, thus lacking adaptability to language learning. To address this gap, we argue that LLMs have the potential to serve as effective tutors in English Education. Specifically, LLMs can play three critical roles: (1) as data enhancers, improving the creation of learning materials or serving as student simulations; (2) as task predictors, serving as learner assessment or optimizing learning pathway; and (3) as agents, enabling personalized and inclusive education. We encourage interdisciplinary research to explore these roles, fostering innovation while addressing challenges and risks, ultimately advancing English Education through the thoughtful integration of LLMs.
中文: 大语言模型可作为数据增强器、任务预测器和个性化代理,通过三种关键角色提升英语教育的适应性与创新性。
English: Large language models can enhance English education by serving as data enhancers, task predictors, and personalized agents to create adaptive learning experiences.

Authors:Tuan Truong, Chau Nguyen, Huy Nguyen, Minh Le, Trung Le, Nhat Ho
Title: RepLoRA: Reparameterizing Low-Rank Adaptation via the Perspective of Mixture of Experts
Abstract:
Low-rank Adaptation (LoRA) has emerged as a powerful method for fine-tuning large-scale foundation models. Despite its popularity, the theoretical understanding of LoRA has remained limited. This paper presents a theoretical analysis of LoRA by examining its connection to the Mixture of Experts models. Under this framework, we show that simple reparameterizations of the LoRA matrices can notably accelerate the low-rank matrix estimation process. In particular, we prove that reparameterization can reduce the data needed to achieve a desired estimation error from an exponential to a polynomial scale. Motivated by this insight, we propose Reparameterized Low-Rank Adaptation (RepLoRA), which incorporates lightweight MLPs to reparameterize the LoRA matrices. Extensive experiments across multiple domains demonstrate that RepLoRA consistently outperforms vanilla LoRA. Notably, with limited data, RepLoRA surpasses LoRA by a margin of up to 40.0% and achieves LoRA's performance with only 30.0% of the training data, highlighting both the theoretical and empirical robustness of our PEFT method.
中文: 本文从理论上将LoRA与混合专家模型联系起来,提出RepLoRA方法,通过重参数化显著降低数据需求,并在性能上超越标准LoRA。
English: This paper theoretically links LoRA to Mixture of Experts models, proposing RepLoRA which uses reparameterization to significantly reduce data requirements and improve performance over standard LoRA.

Authors:Nghiem T. Diep, Huy Nguyen, Chau Nguyen, Minh Le, Duy M. H. Nguyen, Daniel Sonntag, Mathias Niepert, Nhat Ho
Title: On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation
Abstract:
The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.
Chinese: 本文通过理论分析将LLaMA-Adapter中的零初始化注意力与专家混合模型联系起来,证明非线性提示优于线性提示,且两者在性能和适应性上都超越了传统注意力机制。
English: This paper provides a theoretical analysis linking zero-initialized attention in LLaMA-Adapter to mixture-of-expert models, demonstrating that non-linear prompts outperform linear ones and both surpass vanilla attention in performance and adaptability.

Authors:Pau Colomer, Christian Deppe, Holger Boche, Andreas Winter
Title: Rate-reliability tradeoff for deterministic identification
Abstract:
We investigate deterministic identification over arbitrary memoryless channels under the constraint that the error probabilities of first and second kind are exponentially small in the block length $\mathbf{n}$, controlled by reliability exponents $\mathbf{E_1,E_2 \geq 0}$. In contrast to the regime of slowly vanishing errors, where the identifiable message length scales linearithmically as $\mathbf{Θ(n\log n)}$, here we find that for positive exponents linear scaling is restored, now with a rate that is a function of the reliability exponents. We give upper and lower bounds on the ensuing rate-reliability function in terms of (the logarithm of) the packing and covering numbers of the channel output set, which for small error exponents $\mathbf{E_1,E_2>0}$ can be expanded in leading order as the product of the Minkowski dimension of a certain parametrisation the channel output set and $\mathbf{\log\min\{E_1,E_2\}}$. These allow us to recover the previously observed slightly superlinear identification rates, and offer a different perspective for understanding them in more traditional information theory terms. We also show that even if only one of the two errors is required to be exponentially small, the linearithmic scaling is lost. We further illustrate our results with a discussion of the case of dimension zero, and extend them to classical-quantum channels and quantum channels with tensor product input restriction.
中文: 本研究证明,在无记忆信道上,当错误概率随块长度呈指数下降时,确定性识别的消息长度可实现线性缩放,与缓慢消失错误下的线性对数缩放形成对比,并基于打包和覆盖数给出了速率-可靠性函数的上下界。
English: This study demonstrates that deterministic identification over memoryless channels achieves linear scaling in message length when error probabilities decrease exponentially with block length, contrasting with the linearithmic scaling observed for slowly vanishing errors, and provides bounds on the rate-reliability function based on packing and covering numbers.

Authors:Jie Ren, Yuhang Zhang, Dongrui Liu, Xiaopeng Zhang, Qi Tian
Title: Refining Alignment Framework for Diffusion Models with Intermediate-Step Preference Ranking
Abstract:
Direct preference optimization (DPO) has shown success in aligning diffusion models with human preference. Previous approaches typically assume a consistent preference label between final generations and noisy samples at intermediate steps, and directly apply DPO to these noisy samples for fine-tuning. However, we theoretically identify inherent issues in this assumption and its impacts on the effectiveness of preference alignment. We first demonstrate the inherent issues from two perspectives: gradient direction and preference order, and then propose a Tailored Preference Optimization (TailorPO) framework for aligning diffusion models with human preference, underpinned by some theoretical insights. Our approach directly ranks intermediate noisy samples based on their step-wise reward, and effectively resolves the gradient direction issues through a simple yet efficient design. Additionally, we incorporate the gradient guidance of diffusion models into preference alignment to further enhance the optimization effectiveness. Experimental results demonstrate that our method significantly improves the model's ability to generate aesthetically pleasing and human-preferred images.
中文: 本文提出TailorPO框架,通过基于逐步奖励对中间噪声样本排序并结合梯度引导,解决了扩散模型与人类偏好对齐的理论缺陷,显著提升了图像生成的美学质量和人类偏好度。
English: This paper introduces TailorPO, a novel framework that addresses theoretical flaws in aligning diffusion models with human preferences by ranking intermediate noisy samples based on step-wise rewards and incorporating gradient guidance, significantly improving image generation quality.

Authors:Vinay Kumar, Claudio Cicconetti, Marco Conti, Andrea Passarella
Title: Quantum Internet: Technologies, Protocols, and Research Challenges
Abstract:
As the field of the quantum internet advances, a comprehensive guide to navigate its complexities has become increasingly crucial. While quantum computing shares foundational principles with the quantum internet, distinguishing between the two is essential for further development and deeper understanding. This work systematically introduces the quantum internet by discussing its importance, core components, operational mechanisms, anticipated timeline for viability, key contributors, major challenges, and future directions. Additionally, it presents the fundamental concepts of quantum mechanics that underpin the technology, offering a clear and targeted overview intended for researchers and industry professionals and laying the groundwork for future innovations and research in the field.
中文: 本文系统介绍了量子互联网的重要性、核心要素、运行机制、发展时间表及未来方向,并阐释了支撑该技术的量子力学基础,为研究人员和行业专家提供了清晰的概述。
English: This work provides a systematic guide to the quantum internet, outlining its significance, components, mechanisms, timeline, challenges, and future directions, while explaining the underlying quantum principles for researchers and professionals.

Authors:Emanuele Troiani, Hugo Cui, Yatin Dandi, Florent Krzakala, Lenka Zdeborová
Title: Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds
Abstract:
In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayesian-optimal learning, in the limit of large dimension $D$ and commensurably large number of samples $N$, we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting --namely approximate message-passing--, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.
中文: 本研究探讨了具有绑定和低秩权重的深度注意力神经网络,将其映射为序列多指标模型,并在贝叶斯最优学习中推导出渐近性能极限,揭示了层级顺序学习模式。
English: This research investigates deep attention neural networks with tied and low-rank weights, mapping them to sequence multi-index models and deriving asymptotic performance limits in Bayesian-optimal learning, revealing sequential layer learning patterns.

Authors:Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, Eiko Yoneki
Title: Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs
Abstract:
Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about serving LLMs over heterogeneous GPU resources on cloud platforms. The rationale is that different GPU types exhibit distinct compute and memory characteristics, aligning well with the divergent resource demands of diverse requests. Particularly, through comprehensive benchmarking, we discover that the cost-efficiency of LLM serving can be substantially optimized by meticulously determining GPU composition, deployment configurations, and workload assignments. Subsequently, we design a scheduling algorithm via mixed-integer linear programming, aiming at deducing the most cost-efficient serving plan under the constraints of price budget and real-time GPU availability. Remarkably, our approach effectively outperforms homogeneous and heterogeneous baselines under a wide array of scenarios, covering diverse workload traces, varying GPU availablilities, and multi-model serving. This casts new light on more accessible and efficient LLM serving over heterogeneous cloud resources.
中文摘要:本研究通过异构GPU资源配置和混合整数线性规划调度算法,证明了在满足多样化工作负载需求的同时,能够显著提升大语言模型服务的成本效益。
English Summary: This study demonstrates that optimizing Large Language Model serving through heterogeneous GPU resource allocation and a mixed-integer linear programming scheduler significantly enhances cost-efficiency while accommodating diverse workload demands.

Authors:Nishant Balepur, Alexa Siu, Nedim Lipka, Franck Dernoncourt, Tong Sun, Jordan Boyd-Graber, Puneet Mathur
Title: MODS: Moderating a Mixture of Document Speakers to Summarize Debatable Queries in Document Collections
Abstract:
Query-focused summarization (QFS) gives a summary of documents to answer a query. Past QFS work assumes queries have one answer, ignoring debatable ones (Is law school worth it?). We introduce Debatable QFS (DQFS), a task to create summaries that answer debatable queries via documents with opposing perspectives; summaries must comprehensively cover all sources and balance perspectives, favoring no side. These goals elude LLM QFS systems, which: 1) lack structured content plans, failing to guide LLMs to write balanced summaries, and 2) use the same query to retrieve contexts across documents, failing to cover all perspectives specific to each document's content. To overcome this, we design MODS, a multi-LLM framework mirroring human panel discussions. MODS treats documents as individual Speaker LLMs and has a Moderator LLM that picks speakers to respond to tailored queries for planned topics. Speakers use tailored queries to retrieve relevant contexts from their documents and supply perspectives, which are tracked in a rich outline, yielding a content plan to guide the final summary. Experiments on ConflictingQA with controversial web queries and DebateQFS, our new dataset of debate queries from Debatepedia, show MODS beats SOTA by 38-59% in topic paragraph coverage and balance, based on new citation metrics. Users also find MODS's summaries to be readable and more balanced.
中文摘要:可辩论查询聚焦摘要(DQFS)任务旨在通过整合对立视角的文档来回答具有争议性的查询,其MODS框架模拟人类讨论,在覆盖率和平衡性上显著优于现有系统。
English Summary: Debatable Query-Focused Summarization (DQFS) is introduced to address queries with multiple perspectives by generating balanced summaries from opposing documents, using the MODS framework that outperforms state-of-the-art systems in coverage and balance.

Authors:Fanqi Yan, Huy Nguyen, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo
Title: Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
Abstract:
At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and it inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the literature. This paper closes this gap by theoretically demonstrating that sigmoid self-attention is more sample-efficient than its softmax counterpart. Toward that goal, we represent the self-attention matrix as a mixture of experts and show that ``experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
中文: 本文从理论上证明,sigmoid自注意力机制比softmax更具样本效率,因为它消除了令牌间的竞争,并需要更少的数据达到相同的近似误差。
English: This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than softmax self-attention, as it eliminates token competition and requires less data to achieve the same approximation error.

Authors:Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal
Title: Developing robust methods to handle missing data in real-world applications effectively
Abstract:
Missing data is a pervasive challenge spanning diverse data types, including tabular, sensor data, time-series, images and so on. Its origins are multifaceted, resulting in various missing mechanisms. Prior research in this field has predominantly revolved around the assumption of the Missing Completely At Random (MCAR) mechanism. However, Missing At Random (MAR) and Missing Not At Random (MNAR) mechanisms, though equally prevalent, have often remained underexplored despite their significant influence. This PhD project presents a comprehensive research agenda designed to investigate the implications of diverse missing data mechanisms. The principal aim is to devise robust methodologies capable of effectively handling missing data while accommodating the unique characteristics of MCAR, MAR, and MNAR mechanisms. By addressing these gaps, this research contributes to an enriched understanding of the challenges posed by missing data across various industries and data modalities. It seeks to provide practical solutions that enable the effective management of missing data, empowering researchers and practitioners to leverage incomplete datasets confidently.
中文: 该博士研究针对未被充分探索但普遍存在的MAR和MNAR缺失数据机制,开发能够处理各行业多种缺失数据类型的稳健方法,使研究人员能够放心使用不完整数据集。
English: This PhD research addresses the underexplored yet prevalent MAR and MNAR missing data mechanisms by developing robust methodologies to handle diverse missing data types across industries, enabling confident use of incomplete datasets.

Authors:Linshan Wu, Jiaxin Zhuang, Yanning Zhou, Sunan He, Jiabo Ma, Luyang Luo, Xi Wang, Xuefeng Ni, Xiaoling Zhong, Mingxiang Wu, Yinghua Zhao, Xiaohui Duan, Varut Vardhanabhuti, Pranav Rajpurkar, Hao Chen
Title: FreeTumor: Large-Scale Generative Tumor Synthesis in Computed Tomography Images for Improving Tumor Recognition
Abstract:
Tumor is a leading cause of death worldwide, with an estimated 10 million deaths attributed to tumor-related diseases every year. AI-driven tumor recognition unlocks new possibilities for more precise and intelligent tumor screening and diagnosis. However, the progress is heavily hampered by the scarcity of annotated datasets, which demands extensive annotation efforts by radiologists. To tackle this challenge, we introduce FreeTumor, an innovative Generative AI (GAI) framework to enable large-scale tumor synthesis for mitigating data scarcity. Specifically, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for tumor synthesis training. Unleashing the power of large-scale data, FreeTumor is capable of synthesizing a large number of realistic tumors on images for augmenting training datasets. To this end, we create the largest training dataset for tumor synthesis and recognition by curating 161,310 publicly available Computed Tomography (CT) volumes from 33 sources, with only 2.3% containing annotated tumors. To validate the fidelity of synthetic tumors, we engaged 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors, as they achieved only 51.1% sensitivity and 60.8% accuracy in distinguishing our synthetic tumors from real ones. Through high-quality tumor synthesis, FreeTumor scales up the recognition training datasets by over 40 times, showcasing a notable superiority over state-of-the-art AI methods including various synthesis methods and foundation models. These findings indicate promising prospects of FreeTumor in clinical applications, potentially advancing tumor treatments and improving the survival rates of patients.
中文:AI驱动的肿瘤诊断面临数据稀缺,但FreeTumor框架通过有限标注和大量未标注CT数据合成逼真肿瘤,将训练数据集扩大40倍以上,并在临床评估中展现出高度真实性。
English: AI-driven tumor diagnosis faces data scarcity, but the FreeTumor framework synthesizes realistic tumors using limited labeled and extensive unlabeled CT data, scaling training datasets over 40 times and demonstrating high fidelity in clinician evaluations.

Authors:Tahsin Alamgir Kheya, Mohamed Reda Bouadjenek, Sunil Aryal
Title: Unmasking Gender Bias in Recommendation Systems and Enhancing Category-Aware Fairness
Abstract:
Recommendation systems are now an integral part of our daily lives. We rely on them for tasks such as discovering new movies, finding friends on social media, and connecting job seekers with relevant opportunities. Given their vital role, we must ensure these recommendations are free from societal stereotypes. Therefore, evaluating and addressing such biases in recommendation systems is crucial. Previous work evaluating the fairness of recommended items fails to capture certain nuances as they mainly focus on comparing performance metrics for different sensitive groups. In this paper, we introduce a set of comprehensive metrics for quantifying gender bias in recommendations. Specifically, we show the importance of evaluating fairness on a more granular level, which can be achieved using our metrics to capture gender bias using categories of recommended items like genres for movies. Furthermore, we show that employing a category-aware fairness metric as a regularization term along with the main recommendation loss during training can help effectively minimize bias in the models' output. We experiment on three real-world datasets, using five baseline models alongside two popular fairness-aware models, to show the effectiveness of our metrics in evaluating gender bias. Our metrics help provide an enhanced insight into bias in recommended items compared to previous metrics. Additionally, our results demonstrate how incorporating our regularization term significantly improves the fairness in recommendations for different categories without substantial degradation in overall recommendation performance.
中文摘要:本文提出了一套通过分析项目类别来评估推荐系统中性别偏见的综合指标,并证明在训练过程中将这些指标作为正则化项可有效减少偏见,同时保持推荐性能。
English Summary: This paper introduces comprehensive metrics for evaluating gender bias in recommendation systems by analyzing item categories, and demonstrates that incorporating these metrics as regularization during training effectively reduces bias while maintaining performance.

Authors:Alice Natalina Caragliano, Filippo Ruffini, Carlo Greco, Edy Ippolito, Michele Fiore, Claudia Tacconi, Lorenzo Nibid, Giuseppe Perrone, Sara Ramella, Paolo Soda, Valerio Guarrasi
Title: Doctor-in-the-Loop: An Explainable, Multi-View Deep Learning Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer
Abstract:
Non-small cell lung cancer (NSCLC) remains a major global health challenge, with high post-surgical recurrence rates underscoring the need for accurate pathological response predictions to guide personalized treatments. Although artificial intelligence models show promise in this domain, their clinical adoption is limited by the lack of medically grounded guidance during training, often resulting in non-explainable intrinsic predictions. To address this, we propose Doctor-in-the-Loop, a novel framework that integrates expert-driven domain knowledge with explainable artificial intelligence techniques, directing the model toward clinically relevant anatomical regions and improving both interpretability and trustworthiness. Our approach employs a gradual multi-view strategy, progressively refining the model's focus from broad contextual features to finer, lesion-specific details. By incorporating domain insights at every stage, we enhance predictive accuracy while ensuring that the model's decision-making process aligns more closely with clinical reasoning. Evaluated on a dataset of NSCLC patients, Doctor-in-the-Loop delivers promising predictive performance and provides transparent, justifiable outputs, representing a significant step toward clinically explainable artificial intelligence in oncology.
中文: Doctor-in-the-Loop框架通过将医学专业知识与可解释人工智能相结合,引导模型关注临床相关特征,从而提升非小细胞肺癌治疗的预测准确性和决策透明度。
English: The Doctor-in-the-Loop framework integrates medical expertise with explainable AI to enhance predictive accuracy and transparency in NSCLC treatment by guiding models toward clinically relevant features.

Authors:Yuxiang Guo, Yuren Mao, Zhonghao Hu, Lu Chen, Yunjun Gao
Title: Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns
Abstract:
Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, Snoopy, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection.To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that Snoopy outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency--being at least 5 orders of magnitude faster than cell-level solutions, and 3.5x faster than existing column-level methods.
中文摘要:本文提出了一种新颖的列级语义连接发现框架Snoopy,通过基于代理列的嵌入方法解决了现有方法的局限性,在保持高召回率和NDCG指标的同时,实现了比单元级方法快五个数量级、比现有列级方法快3.5倍的卓越效率。
English Summary: This paper introduces Snoopy, a novel column-level semantic join discovery framework that uses proxy-column-based embeddings to overcome the limitations of existing methods, achieving both high effectiveness and superior efficiency by being significantly faster than cell-level approaches and outperforming state-of-the-art column-level methods in key metrics.

Authors:Javier Conde, Miguel González, Pedro Reviriego, Zhen Gao, Shanshan Liu, Fabrizio Lombardi
Title: Speed and Conversational Large Language Models: Not All Is About Tokens per Second
Abstract:
The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.
中文: 本研究分析了主流开源大语言模型在GPU上的运行速度及其随任务类型的变化情况。
English: This study analyzes the speed of popular open-weights large language models on GPUs and how it varies with different tasks.

Authors:Javier Conde, Pedro Reviriego, Joaquín Salvachúa, Gonzalo Martínez, José Alberto Hernández, Fabrizio Lombardi
Title: Understanding the Impact of Artificial Intelligence in Academic Writing: Metadata to the Rescue
Abstract:
This column advocates for including artificial intelligence (AI)-specific metadata on those academic papers that are written with the help of AI in an attempt to analyze the use of such tools for disseminating research.
中文: 本专栏主张在人工智能辅助撰写的学术论文中添加特定元数据,以分析此类工具在传播研究中的应用情况。
English: This column proposes adding AI-specific metadata to academic papers assisted by AI to analyze how such tools are used in research dissemination.

Authors:Javier Conde, Gonzalo Martínez, Pedro Reviriego, Zhen Gao, Shanshan Liu, Fabrizio Lombardi
Title: Can ChatGPT Learn to Count Letters?
Abstract:
Large language models (LLMs) struggle on simple tasks such as counting the number of occurrences of a letter in a word. In this paper, we investigate if ChatGPT can learn to count letters and propose an efficient solution.
Chinese: 本文探讨了ChatGPT是否能够学习计算单词中字母的出现次数,并提出了一种高效的解决方案。
English: This paper explores whether ChatGPT can learn to count letter occurrences in words and presents an effective method to address this challenge.

Authors:Wenhao Hu, Wenhao Chai, Shengyu Hao, Xiaotong Cui, Xuexiang Wen, Jenq-Neng Hwang, Gaoang Wang
Title: Pointmap Association and Piecewise-Plane Constraint for Consistent and Compact 3D Gaussian Segmentation Field
Abstract:
Achieving a consistent and compact 3D segmentation field is crucial for maintaining semantic coherence across views and accurately representing scene structures. Previous 3D scene segmentation methods rely on video segmentation models to address inconsistencies across views, but the absence of spatial information often leads to object misassociation when object temporarily disappear and reappear. Furthermore, in the process of 3D scene reconstruction, segmentation and optimization are often treated as separate tasks. As a result, optimization typically lacks awareness of semantic category information, which can result in floaters with ambiguous segmentation. To address these challenges, we introduce CCGS, a method designed to achieve both view consistent 2D segmentation and a compact 3D Gaussian segmentation field. CCGS incorporates pointmap association and a piecewise-plane constraint. First, we establish pixel correspondence between adjacent images by minimizing the Euclidean distance between their pointmaps. We then redefine object mask overlap accordingly. The Hungarian algorithm is employed to optimize mask association by minimizing the total matching cost, while allowing for partial matches. To further enhance compactness, the piecewise-plane constraint restricts point displacement within local planes during optimization, thereby preserving structural integrity. Experimental results on ScanNet and Replica datasets demonstrate that CCGS outperforms existing methods in both 2D panoptic segmentation and 3D Gaussian segmentation.
Chinese: CCGS通过引入点图关联和分段平面约束,实现了视图一致的2D分割和紧凑的3D高斯分割场,在多个数据集上超越了现有方法。
English: CCGS introduces pointmap association and a piecewise-plane constraint to achieve consistent 2D segmentation and compact 3D Gaussian segmentation, outperforming existing methods on benchmark datasets.

Authors:Shenzhi Yang, Junbo Zhao, Shouqing Yang, Yixuan Li, Dingyu Yang, Xiaofang Zhang, Haobo Wang
Title: Category-free Out-of-Distribution Node Detection with Feature Resonance
Abstract:
Detecting out-of-distribution (OOD) nodes in the graph-based machine-learning field is challenging, particularly when in-distribution (ID) node multi-category labels are unavailable. Thus, we focus on feature space rather than label space and find that, ideally, during the optimization of known ID samples, unknown ID samples undergo more significant representation changes than OOD samples, even if the model is trained to fit random targets, which we called the Feature Resonance phenomenon. The rationale behind it is that even without gold labels, the local manifold may still exhibit smooth resonance. Based on this, we further develop a novel graph OOD framework, dubbed Resonance-based Separation and Learning (RSL), which comprises two core modules: (i) a more practical micro-level proxy of feature resonance that measures the movement of feature vectors in one training step. (ii) integrate with synthetic OOD nodes strategy to train an effective OOD classifier. Theoretically, we derive an error bound showing the superior separability of OOD nodes during the resonance period. Empirically, RSL achieves state-of-the-art performance, reducing the FPR95 metric by an average of 18.51% across five real-world datasets.
中文摘要:本研究提出特征共振现象,即模型训练期间分布内节点比分布外节点表现出更大的表征变化,基于此开发的共振分离学习框架在分布外检测中实现了最优性能。
English Summary: The study introduces a Feature Resonance phenomenon where in-distribution nodes show greater representation changes than out-of-distribution nodes during model training, leading to the Resonance-based Separation and Learning framework that achieves state-of-the-art OOD detection performance.

Authors:Weiming Liu, Chaochao Chen, Jiahe Xu, Xinting Liao, Fan Wang, Xiaolin Zheng, Zhihui Fu, Ruiguang Pei, Jun Wang
Title: Joint Similarity Item Exploration and Overlapped User Guidance for Multi-Modal Cross-Domain Recommendation
Abstract:
Cross-Domain Recommendation (CDR) has been widely investigated for solving long-standing data sparsity problem via knowledge sharing across domains. In this paper, we focus on the Multi-Modal Cross-Domain Recommendation (MMCDR) problem where different items have multi-modal information while few users are overlapped across domains. MMCDR is particularly challenging in two aspects: fully exploiting diverse multi-modal information within each domain and leveraging useful knowledge transfer across domains. However, previous methods fail to cluster items with similar characteristics while filtering out inherit noises within different modalities, hurdling the model performance. What is worse, conventional CDR models primarily rely on overlapped users for domain adaptation, making them ill-equipped to handle scenarios where the majority of users are non-overlapped. To fill this gap, we propose Joint Similarity Item Exploration and Overlapped User Guidance (SIEOUG) for solving the MMCDR problem. SIEOUG first proposes similarity item exploration module, which not only obtains pair-wise and group-wise item-item graph knowledge, but also reduces irrelevant noise for multi-modal modeling. Then SIEOUG proposes user-item collaborative filtering module to aggregate user/item embeddings with the attention mechanism for collaborative filtering. Finally SIEOUG proposes overlapped user guidance module with optimal user matching for knowledge sharing across domains. Our empirical study on Amazon dataset with several different tasks demonstrates that SIEOUG significantly outperforms the state-of-the-art models under the MMCDR setting.
中文: 本文提出SIEOUG模型,通过相似项目探索模块有效聚类多模态项目并过滤噪声,结合重叠用户引导模块实现跨领域知识迁移,在亚马逊数据集上的实验表明该模型在多模态跨领域推荐任务中显著优于现有方法。
English: This paper introduces SIEOUG, a novel model that addresses the Multi-Modal Cross-Domain Recommendation challenge by effectively clustering similar items while filtering noise and leveraging overlapped user guidance for cross-domain knowledge transfer, demonstrating superior performance on Amazon datasets.

Authors:Yue Zhou, Yi Chang, Yuan Wu
Title: Mixup Model Merge: Enhancing Model Merging Performance through Randomized Linear Interpolation
Abstract:
Model merging aims to integrate multiple task-specific models into a unified model that inherits the capabilities of the task-specific models, without additional training. Existing model merging methods often lack consideration of the varying contribution ratios of different task-specific models to the final merged model. In this paper, we propose Mixup Model Merge (M3), a simple yet effective method inspired by the randomized linear interpolation strategy from the Mixup data augmentation technique. M3 performs randomized linear interpolation in parameter space between two task-specific LLMs, where interpolation coefficients are sampled from a Beta distribution to explore diverse contribution ratios. This controllable randomness allows M3 to outperform standard equal-ratio merging by discovering better contribution ratio combinations. Extensive experiments show that M3 significantly (1) improves merged LLM performance across tasks, (2) enhances out-of-distribution and adversarial robustness, (3) outperforms the positive effects of the sparsification method DARE on model merging and can be further combined with DARE to achieve superior results, and (4) balances exploration efficiency and diversity in contribution ratios by tuning the Beta distribution's shape parameters. The code is provided in the supplementary materials.
中文: 本文提出Mixup模型合并方法(M3),通过采用Beta分布采样的随机线性插值来优化任务特定模型在合并中的贡献比例,显著提升了性能、鲁棒性和效率,优于传统等比例合并方法。
English: This paper introduces Mixup Model Merge (M3), a method that uses randomized linear interpolation with Beta-distributed coefficients to optimize the contribution ratios of task-specific models during merging, significantly enhancing performance, robustness, and efficiency over standard techniques.

Authors:Siyuan Wang, Enda Zhao, Zhongyu Wei, Xiang Ren
Title: Stepwise Informativeness Search for Efficient and Effective LLM Reasoning
Abstract:
Advances in Large Language Models (LLMs) have significantly improved multi-step reasoning through generating free-text rationales. However, recent studies show that LLMs tend to lose focus over the middle of long contexts. This raises concerns that as reasoning progresses, LLMs may overlook information in earlier steps when decoding subsequent steps, leading to generate unreliable and redundant rationales. To address this, we propose guiding LLMs to generate more accurate and concise step-by-step rationales by (1) proactively referencing information from underutilized prior steps, and (2) minimizing redundant information between new and existing steps. We introduce stepwise informativeness search, an inference-time tree search framework incorporating two selection heuristics: grounding-guided selection which prioritizes steps paying higher attention over underutilized steps; and novelty-guided selection which encourages steps with novel conclusions. During rationale generation, we use a self-grounding strategy that prompts LLMs to explicitly reference relevant prior steps to provide premises before deduction at each step. Experimental results on four reasoning datasets demonstrate that our approach improves reasoning accuracy by generating higher-quality rationales with reduced errors and redundancy.
Chinese: 我们的方法通过引导大语言模型主动参考未被充分利用的先前步骤并减少冗余信息,生成更准确简洁的逐步推理依据,从而提高了推理准确性。
English: Our method enhances multi-step reasoning in LLMs by guiding them to reference underutilized prior steps and minimize redundancy, resulting in more accurate and concise rationales with improved reasoning accuracy.

Authors:Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang, Meng Li
Title: LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design
Abstract:
State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65x to 6.06x higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43x that of the GPU baseline.
中文: LightMamba通过协同设计量化算法和FPGA加速器架构,显著提升了Mamba的推理效率,在能耗和速度上均优于GPU基准。
English: LightMamba introduces a co-designed quantization algorithm and FPGA accelerator to enhance Mamba's inference efficiency, achieving significant energy savings and speed improvements over GPUs.

Authors:Minjie Hong, Yan Xia, Zehan Wang, Jieming Zhu, Ye Wang, Sihang Cai, Xiaoda Yang, Quanyu Dai, Zhenhua Dong, Zhimeng Zhang, Zhou Zhao
Title: EAGER-LLM: Enhancing Large Language Models as Recommenders through Exogenous Behavior-Semantic Integration
Abstract:
Large language models (LLMs) are increasingly leveraged as foundational backbones in the development of advanced recommender systems, offering enhanced capabilities through their extensive knowledge and reasoning. Existing llm-based recommender systems (RSs) often face challenges due to the significant differences between the linguistic semantics of pre-trained LLMs and the collaborative semantics essential for RSs. These systems use pre-trained linguistic semantics but learn collaborative semantics from scratch via the llm-Backbone. However, LLMs are not designed for recommendations, leading to inefficient collaborative learning, weak result correlations, and poor integration of traditional RS features. To address these challenges, we propose EAGER-LLM, a decoder-only llm-based generative recommendation framework that integrates endogenous and exogenous behavioral and semantic information in a non-intrusive manner. Specifically, we propose 1)dual-source knowledge-rich item indices that integrates indexing sequences for exogenous signals, enabling efficient link-wide processing; 2)non-invasive multiscale alignment reconstruction tasks guide the model toward a deeper understanding of both collaborative and semantic signals; 3)an annealing adapter designed to finely balance the model's recommendation performance with its comprehension capabilities. We demonstrate EAGER-LLM's effectiveness through rigorous testing on three public benchmarks.
中文摘要:大型语言模型越来越多地应用于推荐系统,但在整合协同语义方面面临挑战,因此提出了EAGER-LLM框架,通过创新的索引机制、多尺度对齐任务和退火适配器,有效融合行为与语义信息。
English Summary: Large language models are increasingly used in recommender systems but face challenges in integrating collaborative semantics, leading to the development of EAGER-LLM, a framework that efficiently combines behavioral and semantic information through innovative indexing, alignment tasks, and an annealing adapter.

Authors:Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu
Title: LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization
Abstract:
Parameter-efficient fine-tuning (PEFT), particularly Low-Rank Adaptation (LoRA), adapts large language models (LLMs) by training only a small fraction of parameters. However, as the rank of the low-rank matrices used for adaptation increases, LoRA often exhibits an unstable "double descent" phenomenon, characterized by transient divergence in the training loss, which delays convergence and impairs generalization by causing instability due to the attraction to sharp local minima. To address this, we introduce LoRA-MGPO, a framework that incorporates Momentum-Guided Perturbation Optimization (MGPO). MGPO stabilizes training dynamics by mitigating the double descent phenomenon and guiding weight perturbations using momentum vectors from the optimizer's state, thus avoiding dual gradient computations. Additionally, an adaptive normalization scheme scales the magnitude of perturbations based on an exponential moving average (EMA) of gradient norms, further enhancing stability. While EMA controls the magnitude of the perturbations, MGPO guides their direction, ensuring a more stable optimization trajectory. Experiments on a suite of natural language understanding and generation benchmarks show that LoRA-MGPO consistently achieves superior performance over LoRA and other PEFT methods. The analysis indicates that LoRA-MGPO leads to smoother loss curves, faster convergence, and improved generalization by stabilizing the training process and mitigating the attraction to sharp minima.
中文: 提出的LoRA-MGPO框架通过动量引导扰动优化技术增强LoRA,在保持效率的同时有效稳定训练过程并提升语言任务性能。
English: The proposed LoRA-MGPO framework enhances LoRA by integrating Momentum-Guided Perturbation Optimization, which stabilizes training and improves performance on language tasks without sacrificing efficiency.

Authors:Chenlu Guo, Yuan Wu, Yi Chang
Title: NLoRA: Nyström-Initiated Low-Rank Adaptation for Large Language Models
Abstract:
Parameter-efficient fine-tuning (PEFT) is essential for adapting large language models (LLMs), with low-rank adaptation (LoRA) being the most popular approach. However, LoRA suffers from slow convergence, and some recent LoRA variants, such as PiSSA, primarily rely on Singular Value Decomposition (SVD) for initialization, leading to expensive computation. To mitigate these problems, we use the Nyström method, which follows a three-matrix manipulation. We first introduce StructuredLoRA (SLoRA), which investigates adding a small intermediate matrix between the low-rank matrices A and B. Secondly, we propose NyströmLoRA (NLoRA), which leverages Nyström-based initialization for SLoRA to improve its effectiveness and efficiency. Finally, we propose IntermediateTune (IntTune), which explores fine-tuning exclusively on the intermediate matrix of NLoRA to further boost LLM efficiency. We evaluate our methods on five natural language generation (NLG) tasks and eight natural language understanding (NLU) tasks. On GSM8K, SLoRA and NLoRA achieve accuracies of 56.48% and 57.70%, surpassing LoRA by 33.52% and 36.41%, with only 3.67 million additional trainable parameters. IntTune improves average NLG performance over LoRA by 7.45% while using only 1.25% of its parameters. These results demonstrate the efficiency and effectiveness of our approach in enhancing model performance with minimal parameter overhead.
中文: 本文提出了三种新颖的参数高效微调方法——SLoRA、NLoRA和IntTune,通过结构化矩阵优化和基于Nyström的初始化技术,在自然语言生成与理解任务中显著提升大语言模型性能,同时大幅降低计算开销和参数使用量。
English: This paper introduces three novel parameter-efficient fine-tuning methods—SLoRA, NLoRA, and IntTune—that leverage structured matrix optimization and Nyström-based initialization to significantly enhance LLM performance on NLG and NLU tasks while minimizing computational overhead and parameter usage.

Authors:Yuxing Cheng, Yi Chang, Yuan Wu
Title: A Survey on Data Contamination for Large Language Models
Abstract:
Recent advancements in Large Language Models (LLMs) have demonstrated significant progress in various areas, such as text generation and code synthesis. However, the reliability of performance evaluation has come under scrutiny due to data contamination-the unintended overlap between training and test datasets. This overlap has the potential to artificially inflate model performance, as LLMs are typically trained on extensive datasets scraped from publicly available sources. These datasets often inadvertently overlap with the benchmarks used for evaluation, leading to an overestimation of the models' true generalization capabilities. In this paper, we first examine the definition and impacts of data contamination. Secondly, we review methods for contamination-free evaluation, focusing on three strategies: data updating-based methods, data rewriting-based methods, and prevention-based methods. Specifically, we highlight dynamic benchmarks and LLM-driven evaluation methods. Finally, we categorize contamination detecting methods based on model information dependency: white-Box, gray-Box, and black-Box detection approaches. Our survey highlights the requirements for more rigorous evaluation protocols and proposes future directions for addressing data contamination challenges.
中文: 本文研究大语言模型评估中的数据污染问题,综述无污染评估方法,并提出检测方案以保障性能测量的可靠性。
English: This paper examines data contamination in LLM evaluation, reviews contamination-free assessment methods, and proposes detection approaches to ensure reliable performance measurement.

Authors:Beatrice Savoldi, Alan Ramponi, Matteo Negri, Luisa Bentivogli
Title: Translation in the Hands of Many:Centering Lay Users in Machine Translation Interactions
Abstract:
Converging societal and technical factors have transformed language technologies into user-facing applications used by the general public across languages. Machine Translation (MT) has become a global tool, with cross-lingual services now also supported by dialogue systems powered by multilingual Large Language Models (LLMs). Widespread accessibility has extended MT's reach to a vast base of lay users, many with little to no expertise in the languages or the technology itself. And yet, the understanding of MT consumed by such a diverse group of users -- their needs, experiences, and interactions with multilingual systems -- remains limited. In our position paper, we first trace the evolution of MT user profiles, focusing on non-experts and how their engagement with technology may shift with the rise of LLMs. Building on an interdisciplinary body of work, we identify three factors -- usability, trust, and literacy -- that are central to shaping user interactions and must be addressed to align MT with user needs. By examining these dimensions, we provide insights to guide the progress of more user-centered MT.
中文: 机器翻译在非专业用户中的广泛使用,要求我们深入了解其需求、体验及交互方式,重点关注可用性、信任度和认知水平,以开发更以用户为中心的系统。
English: The widespread adoption of machine translation by non-expert users necessitates a deeper understanding of their needs, experiences, and interactions, focusing on usability, trust, and literacy to develop more user-centered systems.

Authors:Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Shengjie Ma, Aofan Liu, Hui Xiong, Jian Guo
Title: LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data
Abstract:
Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets, LongFaith-SFT and LongFaith-PO, which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.
Chinese: LongFaith提出了一种通过整合真实依据和基于引用的提示来合成忠实长上下文推理数据集的流程,无需昂贵验证即可提升模型在推理和问答等任务上的表现。
English: LongFaith introduces a pipeline for creating faithful long-context reasoning datasets by integrating ground truth and citation-based prompts, which enhances model performance on tasks like reasoning and QA without costly verification.

Authors:Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Yunhua Zhou, Xipeng Qiu
Title: Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
Abstract:
The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI's o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models' self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose Shortest Majority Vote, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models' test-time scalability compared to conventional majority voting approaches.
中文摘要:本研究发现类o1大语言模型中更长的思维链会因过多自我修正而降低准确性,并提出结合并行扩展策略的最短多数投票法,显著提升了测试时扩展能力。
English Summary: This study reveals that extended Chain-of-Thought reasoning in o1-like LLMs often reduces accuracy due to excessive self-revisions, and proposes a parallel scaling strategy with Shortest Majority Vote to significantly enhance test-time scalability.

Authors:Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
Title: Small Models Struggle to Learn from Strong Reasoners
Abstract:
Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models ($\leq$3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.
中文: 小模型在与其学习能力匹配的简短推理链上表现更佳,而混合蒸馏方法通过结合不同复杂度的示例,相比单一训练数据显著提升了小模型的推理性能。
English: Small models perform better with simpler reasoning chains matching their capacity, and Mix Distillation, which combines varied complexity examples, significantly enhances their reasoning compared to using only long or short chains.

Authors:Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
Title: SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Abstract:
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 12 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.
中文摘要:新兴大型推理模型通过长思维链增强推理能力但存在安全隐患,本研究通过评估12个先进模型、提出解码策略并首创SafeChain安全训练数据集,在提升模型安全性的同时保持了推理性能。
English Summary: Emerging large reasoning models (LRMs) enhance reasoning through long chain-of-thought processes but pose safety risks, which this study addresses by evaluating 12 LRMs, proposing decoding strategies, and introducing SafeChain—a safety training dataset that improves safety without compromising reasoning performance.

Authors:Zhijun Li, Kuizhi Liu, Minghui Xu, Xiangyu Wang, Yinbin Miao, Jianfeng Ma, Xiuzhen Cheng
Title: Trinity: A Scalable and Forward-Secure DSSE for Spatio-Temporal Range Query
Abstract:
Cloud-based outsourced Location-based services have profound impacts on various aspects of people's lives but bring security concerns. Existing spatio-temporal data secure retrieval schemes have significant shortcomings regarding dynamic updates, either compromising privacy through leakage during updates (forward insecurity) or incurring excessively high update costs that hinder practical application. Under these circumstances, we first propose a basic filter-based spatio-temporal range query scheme \TrinityI that supports low-cost dynamic updates and automatic expansion. Furthermore, to improve security, reduce storage cost, and false positives, we propose a forward secure and verifiable scheme \TrinityII that simultaneously minimizes storage overhead. A formal security analysis proves that \TrinityI and \TrinityII are Indistinguishable under Selective Chosen-Plaintext Attack (IND-SCPA). Finally, extensive experiments demonstrate that our design \TrinityII significantly reduces storage requirements by 80\%, enables data retrieval at the 1 million-record level in just 0.01 seconds, and achieves 10 $\times$ update efficiency than state-of-art.
Chinese: 作者提出了两种安全的时空范围查询方案 \TrinityI 和 \TrinityII,通过支持低成本动态更新、前向安全性和可验证性,显著降低了存储开销并提升了查询与更新效率。
English: The authors introduce two secure spatio-temporal range query schemes, \TrinityI and \TrinityII, which address dynamic update challenges by offering low-cost updates, forward security, and verifiability while significantly reducing storage and improving efficiency.

Authors:Yingli Shen, Wen Lai, Shuo Wang, Xueren Zhang, Kangyang Luo, Alexander Fraser, Maosong Sun
Title: DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
Abstract:
The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and clean multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus built using newly extracted Common Crawl data and existing multilingual datasets. DCAD-2000 includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of current data cleaning methods, which rely on manual heuristic thresholds, we propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or anomalous content. We evaluate the quality of DCAD-2000 on the FineTask benchmark, demonstrating substantial improvements in multilingual dataset quality and task performance.
中文: 本文提出DCAD-2000大规模多语言语料库,通过将数据清洗重构为异常检测任务,显著提升了涵盖2000多种语言的数据集质量和任务表现。
English: The paper introduces DCAD-2000, a large-scale multilingual corpus that reframes data cleaning as an anomaly detection task to enhance dataset quality and performance across over 2,000 languages.

Authors:Jingnan Gao, Weizhe Liu, Weixuan Sun, Senbo Wang, Xibin Song, Taizhang Shang, Shenzhou Chen, Hongdong Li, Xiaokang Yang, Yichao Yan, Pan Ji
Title: MARS: Mesh AutoRegressive Model for 3D Shape Detailization
Abstract:
State-of-the-art methods for mesh detailization predominantly utilize Generative Adversarial Networks (GANs) to generate detailed meshes from coarse ones. These methods typically learn a specific style code for each category or similar categories without enforcing geometry supervision across different Levels of Detail (LODs). Consequently, such methods often fail to generalize across a broader range of categories and cannot ensure shape consistency throughout the detailization process. In this paper, we introduce MARS, a novel approach for 3D shape detailization. Our method capitalizes on a novel multi-LOD, multi-category mesh representation to learn shape-consistent mesh representations in latent space across different LODs. We further propose a mesh autoregressive model capable of generating such latent representations through next-LOD token prediction. This approach significantly enhances the realism of the generated shapes. Extensive experiments conducted on the challenging 3D Shape Detailization benchmark demonstrate that our proposed MARS model achieves state-of-the-art performance, surpassing existing methods in both qualitative and quantitative assessments. Notably, the model's capability to generate fine-grained details while preserving the overall shape integrity is particularly commendable.
中文: MARS方法通过创新的多细节层次、多类别网格表示和自回归模型,能在不同细节层级间生成具有更高真实感和形状一致性的三维模型,显著优于现有方法。
English: The MARS method introduces a novel multi-LOD, multi-category mesh representation and an autoregressive model to generate detailed 3D shapes with enhanced realism and shape consistency across different levels of detail, outperforming existing approaches.

Authors:Hao Jiang, Cheng Jin, Huangjing Lin, Yanning Zhou, Xi Wang, Jiabo Ma, Li Ding, Jun Hou, Runsheng Liu, Zhizhong Chai, Luyang Luo, Huijuan Shi, Yinling Qian, Qiong Wang, Changzhong Li, Anjia Han, Ronald Cheong Kin Chan, Hao Chen
Title: Generalizable Cervical Cancer Screening via Large-scale Pretraining and Test-Time Adaptation
Abstract:
Cervical cancer is a leading malignancy in female reproductive system. While AI-assisted cytology offers a cost-effective and non-invasive screening solution, current systems struggle with generalizability in complex clinical scenarios. To address this issue, we introduced Smart-CCS, a generalizable Cervical Cancer Screening paradigm based on pretraining and adaptation to create robust and generalizable screening systems. To develop and validate Smart-CCS, we first curated a large-scale, multi-center dataset named CCS-127K, which comprises a total of 127,471 cervical cytology whole-slide images collected from 48 medical centers. By leveraging large-scale self-supervised pretraining, our CCS models are equipped with strong generalization capability, potentially generalizing across diverse scenarios. Then, we incorporated test-time adaptation to specifically optimize the trained CCS model for complex clinical settings, which adapts and refines predictions, improving real-world applicability. We conducted large-scale system evaluation among various cohorts. In retrospective cohorts, Smart-CCS achieved an overall area under the curve (AUC) value of 0.965 and sensitivity of 0.913 for cancer screening on 11 internal test datasets. In external testing, system performance maintained high at 0.950 AUC across 6 independent test datasets. In prospective cohorts, our Smart-CCS achieved AUCs of 0.947, 0.924, and 0.986 in three prospective centers, respectively. Moreover, the system demonstrated superior sensitivity in diagnosing cervical cancer, confirming the accuracy of our cancer screening results by using histology findings for validation. Interpretability analysis with cell and slide predictions further indicated that the system's decision-making aligns with clinical practice. Smart-CCS represents a significant advancement in cancer screening across diverse clinical contexts.
中文: Smart-CCS通过大规模预训练和测试时自适应技术,建立了一个泛化性强的宫颈癌筛查系统,在多个临床数据集中展现出优异的准确性和适用性。
English: Smart-CCS introduces a robust cervical cancer screening paradigm using large-scale pretraining and test-time adaptation, demonstrating high accuracy and generalization across multiple clinical datasets.

Authors:Jiacheng Xu, Bo Pang, Jin Qu, Hiroaki Hayashi, Caiming Xiong, Yingbo Zhou
Title: CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification
Abstract:
Software testing is a critical aspect of software development, yet generating test cases remains a routine task for engineers. This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases under specific conditions. Spanning from simple assertion completions to writing test cases that cover specific code blocks across multiple files, these tasks are based on 12 python repositories, analyzing 845 problems with context lengths ranging from 4k to 128k tokens. Utilizing code testing frameworks, we propose a method to construct retrieval contexts using coverage information. While models exhibit comparable performance with short contexts, notable differences emerge with 16k contexts. Notably, models like GPT-4o and Claude 3.5 can effectively leverage relevant snippets; however, all models score below 35\% on the complex Task III, even with the oracle context provided, underscoring the benchmark's significance and the potential for model improvement. The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
中文: 本文提出CLOVER基准,用于评估AI模型在不同复杂度下生成和补全测试用例的能力,结果显示尽管GPT-4o等模型在短上下文表现良好,但所有模型在复杂任务中均表现不佳,凸显了模型改进的巨大空间。
English: This paper introduces CLOVER, a benchmark for evaluating AI models' ability to generate and complete test cases across various complexity levels, revealing that while models like GPT-4o perform well with short contexts, all struggle with complex tasks, highlighting significant room for improvement.

Authors:Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang
Title: Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
Abstract:
Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.
中文: Hephaestus-Forge是首个专为增强LLM智能体核心能力设计的大规模预训练语料库,通过优化数据混合比例显著提升了智能体在API调用、推理规划和环境适应方面的基准表现。
English: Hephaestus-Forge is a pioneering large-scale pre-training corpus designed to enhance LLM agents' capabilities in API function calling, reasoning, and environmental adaptation, significantly improving performance on agent benchmarks through optimized data composition.

Authors:Luca Della Libera, Francesco Paissan, Cem Subakan, Mirco Ravanelli
Title: FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks
Abstract:
Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.
中文: FocalCodec是一种基于焦点调制的高效低码率语音编解码器,仅使用单一二进制码本在0.16-0.65 kbps码率下实现优越的语音处理性能,同时有效保留语义和声学信息。
English: FocalCodec is an efficient low-bitrate speech codec using focal modulation and a single binary codebook, achieving competitive performance in speech tasks at 0.16-0.65 kbps while preserving both semantic and acoustic information.

Authors:Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli
Title: XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
Abstract:
The rapid proliferation of generative audio synthesis and editing technologies has raised significant concerns about copyright infringement, data provenance, and the spread of misinformation through deepfake audio. Watermarking offers a proactive solution by embedding imperceptible, identifiable, and traceable marks into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to achieve both robust detection and accurate attribution simultaneously. This paper introduces Cross-Attention Robust Audio Watermark (XAttnMark), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned temporal-frequency masking loss that captures fine-grained auditory masking effects, enhancing watermark imperceptibility. Our approach achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing with strong editing strength. The project webpage is available at https://liuyixin-louis.github.io/xattnmark/.
中文: 本文提出XAttnMark新型音频水印方法,通过共享参数、交叉注意力机制和心理声学损失,在应对各类音频变换时实现了卓越的鲁棒性和不可感知性,突破了现有技术的局限。
English: This paper introduces XAttnMark, a novel audio watermarking method that overcomes limitations of existing approaches by integrating shared parameters, cross-attention mechanisms, and psychoacoustic loss to achieve superior robustness and imperceptibility against audio transformations.

Authors:Wali Ullah Khan, Chandan Kumar Sheemar, Eva Lagunas, Symeon Chatzinotas
Title: Beyond Diagonal RIS: A New Frontier for 6G Internet of Things Networks
Abstract:
Reconfigurable intelligent surface (RIS) technology has emerged as a promising enabler for next-generation wireless networks, offering a paradigm shift from passive environments to programmable radio wave propagation. Despite the potential of diagonal RIS (D-RIS), its limited wave manipulation capability restricts performance gains. In this paper, we investigate the burgeoning concept of beyond-diagonal RIS (BD-RIS), which incorporates non-diagonal elements in its scattering matrix to deliver more fine-grained control of electromagnetic wavefronts. We begin by discussing the limitations of traditional D-RIS and introduce key BD-RIS architectures with different operating modes. We then highlight the features that make BD-RIS particularly advantageous for 6G IoT applications, including advanced beamforming, enhanced interference mitigation, and flexible coverage. A case study on BD-RIS-assisted vehicle-to-vehicle (V2V) communication in an underlay cellular network demonstrates considerable improvements in spectral efficiency when compared to D-RIS and conventional systems. Lastly, we present current challenges such as hardware design complexity, channel estimation, and non-ideal hardware effects, and propose future research directions involving AI-driven optimization, joint communication and sensing, and physical layer security. Our findings illustrate the transformative potential of BD-RIS in shaping high-performance, scalable, and reliable 6G IoT networks.
Chinese: 超对角智能表面(BD-RIS)通过非对角散射矩阵实现电磁波前精细调控,在提升频谱效率方面展现显著优势,同时为6G物联网应用面临的关键挑战提供了创新解决方案。
English: Beyond-diagonal RIS (BD-RIS) enhances wireless network performance by enabling fine-grained electromagnetic wave control, offering significant gains in spectral efficiency and addressing key challenges for future 6G IoT applications.

Authors:Martin Mundt, Anaelia Ovalle, Felix Friedrich, A Pranav, Subarnaduti Paul, Manuel Brack, Kristian Kersting, William Agnew
Title: The Cake that is Intelligence and Who Gets to Bake it: An AI Analogy and its Implications for Participation
Abstract:
In a widely popular analogy by Turing Award Laureate Yann LeCun, machine intelligence has been compared to cake - where unsupervised learning forms the base, supervised learning adds the icing, and reinforcement learning is the cherry on top. We expand this 'cake that is intelligence' analogy from a simple structural metaphor to the full life-cycle of AI systems, extending it to sourcing of ingredients (data), conception of recipes (instructions), the baking process (training), and the tasting and selling of the cake (evaluation and distribution). Leveraging our re-conceptualization, we describe each step's entailed social ramifications and how they are bounded by statistical assumptions within machine learning. Whereas these technical foundations and social impacts are deeply intertwined, they are often studied in isolation, creating barriers that restrict meaningful participation. Our re-conceptualization paves the way to bridge this gap by mapping where technical foundations interact with social outcomes, highlighting opportunities for cross-disciplinary dialogue. Finally, we conclude with actionable recommendations at each stage of the metaphorical AI cake's life-cycle, empowering prospective AI practitioners, users, and researchers, with increased awareness and ability to engage in broader AI discourse.
Chinese: 本文将Yann LeCun的“AI蛋糕”类比扩展至人工智能系统全生命周期,涵盖从数据采集到成果分发的各个环节,揭示技术基础与社会影响的紧密联系,并提出具体建议以促进跨学科对话和参与。
English: This paper extends Yann LeCun's "AI cake" analogy to encompass the entire lifecycle of AI systems—from data sourcing to distribution—highlighting the intertwined technical and social implications and offering actionable recommendations to foster cross-disciplinary collaboration.

Authors:Walid El Maouaki, Nouhaila Innan, Alberto Marchisio, Taoufik Said, Mohamed Bennai, Muhammad Shafique
Title: QFAL: Quantum Federated Adversarial Learning
Abstract:
Quantum federated learning (QFL) merges the privacy advantages of federated systems with the computational potential of quantum neural networks (QNNs), yet its vulnerability to adversarial attacks remains poorly understood. This work pioneers the integration of adversarial training into QFL, proposing a robust framework, quantum federated adversarial learning (QFAL), where clients collaboratively defend against perturbations by combining local adversarial example generation with federated averaging (FedAvg). We systematically evaluate the interplay between three critical factors: client count (5, 10, 15), adversarial training coverage (0-100%), and adversarial attack perturbation strength (epsilon = 0.01-0.5), using the MNIST dataset. Our experimental results show that while fewer clients often yield higher clean-data accuracy, larger federations can more effectively balance accuracy and robustness when partially adversarially trained. Notably, even limited adversarial coverage (e.g., 20%-50%) can significantly improve resilience to moderate perturbations, though at the cost of reduced baseline performance. Conversely, full adversarial training (100%) may regain high clean accuracy but is vulnerable under stronger attacks. These findings underscore an inherent trade-off between robust and standard objectives, which is further complicated by quantum-specific factors. We conclude that a carefully chosen combination of client count and adversarial coverage is critical for mitigating adversarial vulnerabilities in QFL. Moreover, we highlight opportunities for future research, including adaptive adversarial training schedules, more diverse quantum encoding schemes, and personalized defense strategies to further enhance the robustness-accuracy trade-off in real-world quantum federated environments.
中文摘要:本研究提出量子联邦对抗学习(QFAL)框架,将对抗训练融入量子联邦学习以增强抗攻击鲁棒性,揭示了客户端数量和对抗训练覆盖率影响下精度与稳健性之间的关键权衡关系。
English Summary: This study introduces Quantum Federated Adversarial Learning (QFAL), a framework that integrates adversarial training into quantum federated learning to enhance robustness against attacks, revealing critical trade-offs between accuracy and resilience influenced by client numbers and adversarial coverage.

Authors:Zhixian Zhao, Xinfa Zhu, Xinsheng Wang, Shuiyuan Wang, Xuelong Geng, Wenjie Tian, Lei Xie
Title: Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought
Abstract:
Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.
中文:C$^2$SER是一种新型音频语言模型,通过整合Whisper和Emotion2Vec-S编码器实现上下文感知,并采用思维链方法结合自蒸馏技术,有效减少幻觉现象,在语音情感识别任务中比现有模型表现出更高的稳定性和准确性。
English: C$^2$SER is a novel audio language model that enhances speech emotion recognition stability and accuracy by integrating contextual perception through Whisper and Emotion2Vec-S encoders, and employing a chain of thought approach with self-distillation to reduce hallucinations and improve performance over existing models.

Authors:Zhenyu Tao, Wei Xu, Xiaohu You
Title: Provable Performance Bounds for Digital Twin-driven Deep Reinforcement Learning in Wireless Networks: A Novel Digital-Twin Bisimulation Metric
Abstract:
Digital twin (DT)-driven deep reinforcement learning (DRL) has emerged as a promising paradigm for wireless network optimization, offering safe and efficient training environment for policy exploration. However, in theory existing methods cannot always guarantee real-world performance of DT-trained policies before actual deployment, due to the absence of a universal metric for assessing DT's ability to support reliable DRL training transferrable to physical networks. In this paper, we propose the DT bisimulation metric (DT-BSM), a novel metric based on the Wasserstein distance, to quantify the discrepancy between Markov decision processes (MDPs) in both the DT and the corresponding real-world wireless network environment. We prove that for any DT-trained policy, the sub-optimality of its performance (regret) in the real-world deployment is bounded by a weighted sum of the DT-BSM and its sub-optimality within the MDP in the DT. Then, a modified DT-BSM based on the total variation distance is also introduced to avoid the prohibitive calculation complexity of Wasserstein distance for large-scale wireless network scenarios. Further, to tackle the challenge of obtaining accurate transition probabilities of the MDP in real world for the DT-BSM calculation, we propose an empirical DT-BSM method based on statistical sampling. We prove that the empirical DT-BSM always converges to the desired theoretical one, and quantitatively establish the relationship between the required sample size and the target level of approximation accuracy. Numerical experiments validate this first theoretical finding on the provable and calculable performance bounds for DT-driven DRL.
中文: 本文提出数字孪生双模拟度量(DT-BSM)来量化数字孪生与真实无线环境之间的差异,为数字孪生训练策略在真实部署中的性能提供了理论边界,并给出了实用的经验计算方法。
English: This paper introduces the DT bisimulation metric (DT-BSM) to quantify the discrepancy between digital twin and real-world wireless environments, providing a theoretical bound for the performance of DT-trained policies in real deployments and offering a practical empirical method for its calculation.

Authors:Tianyi Zhuang, Chuqiao Kuang, Xiaoguang Li, Yihua Teng, Jihao Wu, Yasheng Wang, Lifeng Shang
Title: DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities
Abstract:
We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.
Chinese: DocPuzzle 是一个评估大语言模型长文本推理能力的基准,包含 100 个专家级问答问题,采用创新的评估框架减少猜测偏差,结果显示慢思考模型优于通用模型,且蒸馏方法难以保持推理能力的泛化性。
English: DocPuzzle is a benchmark for evaluating long-context reasoning in LLMs, featuring 100 expert-level QA problems and an innovative evaluation framework that reduces guessing bias, with results showing slow-thinking models outperform general ones and distillation struggles to maintain reasoning capabilities.

Authors:Zhenheng Tang, Xiang Liu, Qian Wang, Peijie Dong, Bingsheng He, Xiaowen Chu, Bo Li
Title: The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?
Abstract:
Motivated by reducing the computational and storage costs of LLMs, model compression and KV cache compression have attracted much attention from researchers. However, current methods predominantly emphasize maintaining the performance of compressed LLMs, as measured by perplexity or simple accuracy on tasks of common sense knowledge QA and basic arithmetic reasoning. In this blog, we present a brief review of recent advancements in LLMs related to retrieval-augmented generation, multi-step reasoning, external tools, and computational expressivity, all of which substantially enhance LLM performance. Then, we propose a lottery LLM hypothesis suggesting that for a given LLM and task, there exists a smaller lottery LLM capable of producing the same performance as the original LLM with the assistance of multi-step reasoning and external tools. Based on the review of current progress in LLMs, we discuss and summarize the essential capabilities that the lottery LLM and KV cache compression must possess, which are currently overlooked in existing methods.
中文摘要:研究者们致力于通过压缩降低大语言模型成本,但忽视了高级任务所需的关键能力,提出彩票大语言模型假说,即借助多步推理和外部工具,较小模型可达到原模型同等性能。
English Summary: Researchers focus on reducing LLM costs through compression but overlook essential capabilities needed for advanced tasks, proposing a lottery LLM hypothesis that smaller models can match original performance with reasoning and tool support.

Authors:Yinan Deng, Bicheng Yao, Yihang Tang, Yi Yang, Yufeng Yue
Title: OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation
Abstract:
In recent years, vision-language models (VLMs) have advanced open-vocabulary mapping, enabling mobile robots to simultaneously achieve environmental reconstruction and high-level semantic understanding. While integrated object cognition helps mitigate semantic ambiguity in point-wise feature maps, efficiently obtaining rich semantic understanding and robust incremental reconstruction at the instance-level remains challenging. To address these challenges, we introduce OpenVox, a real-time incremental open-vocabulary probabilistic instance voxel representation. In the front-end, we design an efficient instance segmentation and comprehension pipeline that enhances language reasoning through encoding captions. In the back-end, we implement probabilistic instance voxels and formulate the cross-frame incremental fusion process into two subtasks: instance association and live map evolution, ensuring robustness to sensor and segmentation noise. Extensive evaluations across multiple datasets demonstrate that OpenVox achieves state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval. Furthermore, real-world robotics experiments validate OpenVox's capability for stable, real-time operation.
Chinese: OpenVox提出了一种实时增量开放词汇概率实例体素表示方法,提升了实例级语义理解和鲁棒重构能力,在机器人应用中实现了最先进的性能。
English: OpenVox introduces a real-time incremental open-vocabulary probabilistic instance voxel representation that enhances instance-level semantic understanding and robust reconstruction, achieving state-of-the-art performance in robotics applications.

Authors:Xinwei Liu, Xiaojun Jia, Yuan Xun, Hua Zhang, Xiaochun Cao
Title: PersGuard: Preventing Malicious Personalization via Backdoor Attacks on Pre-trained Text-to-Image Diffusion Models
Abstract:
Diffusion models (DMs) have revolutionized data generation, particularly in text-to-image (T2I) synthesis. However, the widespread use of personalized generative models raises significant concerns regarding privacy violations and copyright infringement. To address these issues, researchers have proposed adversarial perturbation-based protection techniques. However, these methods have notable limitations, including insufficient robustness against data transformations and the inability to fully eliminate identifiable features of protected objects in the generated output. In this paper, we introduce PersGuard, a novel backdoor-based approach that prevents malicious personalization of specific images. Unlike traditional adversarial perturbation methods, PersGuard implant backdoor triggers into pre-trained T2I models, preventing the generation of customized outputs for designated protected images while allowing normal personalization for unprotected ones. Unfortunately, existing backdoor methods for T2I diffusion models fail to be applied to personalization scenarios due to the different backdoor objectives and the potential backdoor elimination during downstream fine-tuning processes. To address these, we propose three novel backdoor objectives specifically designed for personalization scenarios, coupled with backdoor retention loss engineered to resist downstream fine-tuning. These components are integrated into a unified optimization framework. Extensive experimental evaluations demonstrate PersGuard's effectiveness in preserving data privacy, even under challenging conditions including gray-box settings, multi-object protection, and facial identity scenarios. Our method significantly outperforms existing techniques, offering a more robust solution for privacy and copyright protection.
中文: PersGuard是一种新颖的基于后门的方法,通过引入专门的后门目标和保留损失,防止文本到图像扩散模型中对特定图像的恶意个性化,相比现有技术提供了更优越的隐私保护。
English: PersGuard is a novel backdoor-based method that prevents malicious personalization of specific images in text-to-image diffusion models by introducing specialized backdoor objectives and retention loss, offering superior privacy protection compared to existing techniques.

Authors:Yuji Zhang, Sha Li, Cheng Qian, Jiateng Liu, Pengfei Yu, Chi Han, Yi R. Fung, Kathleen McKeown, Chengxiang Zhai, Manling Li, Heng Ji
Title: The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination
Abstract:
Hallucination is a persistent challenge in large language models (LLMs), where even with rigorous quality control, models often generate distorted facts. This paradox, in which error generation continues despite high-quality training data, calls for a deeper understanding of the underlying LLM mechanisms. To address it, we propose a novel concept: knowledge overshadowing, where model's dominant knowledge can obscure less prominent knowledge during text generation, causing the model to fabricate inaccurate details. Building on this idea, we introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing. Central to our approach is the log-linear law, which predicts that the rate of factual hallucination increases linearly with the logarithmic scale of (1) Knowledge Popularity, (2) Knowledge Length, and (3) Model Size. The law provides a means to preemptively quantify hallucinations, offering foresight into their occurrence even before model training or inference. Built on overshadowing effect, we propose a new decoding strategy CoDa, to mitigate hallucinations, which notably enhance model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%). Our findings not only deepen understandings of the underlying mechanisms behind hallucinations but also provide actionable insights for developing more predictable and controllable language models.
Chinese: 本研究提出知识遮蔽概念来解释大语言模型中的幻觉现象,构建了量化框架并引入CoDa解码策略,显著降低了多个基准测试中的事实错误率。
English: This study introduces the concept of knowledge overshadowing to explain hallucinations in large language models, proposing a framework to quantify them and a decoding strategy, CoDa, that significantly reduces factual errors across multiple benchmarks.

Authors:Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Shichao Song, Zehao Lin, Yebin Yang, Simin Niu, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, Zhiyu li
Title: SurveyX: Academic Survey Automation via Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated exceptional comprehension capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for automated survey generation. However, recent research related to automated survey generation remains constrained by some critical limitations like finite context window, lack of in-depth content discussion, and absence of systematic evaluation frameworks. Inspired by human writing processes, we propose SurveyX, an efficient and organized system for automated survey generation that decomposes the survey composing process into two phases: the Preparation and Generation phases. By innovatively introducing online reference retrieval, a pre-processing method called AttributeTree, and a re-polishing process, SurveyX significantly enhances the efficacy of survey composition. Experimental evaluation results show that SurveyX outperforms existing automated survey generation systems in content quality (0.259 improvement) and citation quality (1.76 enhancement), approaching human expert performance across multiple evaluation dimensions. Examples of surveys generated by SurveyX are available on www.surveyx.cn
大型语言模型在自动生成综述方面展现出潜力,但受限于有限的上下文窗口和缺乏系统性评估;提出的SurveyX系统通过两阶段流程和创新方法显著提升了内容与引用质量,接近人类专家水平。
Large Language Models (LLMs) show potential for automated survey generation, but face limitations such as finite context windows and lack of systematic evaluation, which the proposed SurveyX system overcomes through a two-phase process and innovative methods to significantly improve content and citation quality, nearing human expert performance.

Authors:Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, Muhan Zhang
Title: LIFT: Improving Long Context Understanding of Large Language Models through Long Input Fine-Tuning
Abstract:
Long context understanding remains challenging for large language models due to their limited context windows. This paper presents Long Input Fine-Tuning (LIFT), a novel framework for long-context modeling that can improve the long-context performance of arbitrary (short-context) LLMs by dynamically adapting model parameters based on the long input. Importantly, LIFT, rather than endlessly extending the context window size to accommodate increasingly longer inputs in context, chooses to store and absorb the long input in parameter. By fine-tuning the long input into model parameters, LIFT allows short-context LLMs to answer questions even when the required information is not provided in the context during inference. Furthermore, to enhance LIFT performance while maintaining the original in-context learning (ICL) capabilities, we introduce Gated Memory, a specialized attention adapter that automatically balances long input memorization and ICL. We provide a comprehensive analysis of the strengths and limitations of LIFT on long context understanding, offering valuable directions for future research.
中文: 本文提出了长输入微调(LIFT)框架,通过将长输入信息融入模型参数并采用门控记忆机制,有效提升短上下文大语言模型的长文本理解能力,同时保持其上下文学习性能。
English: This paper introduces the Long Input Fine-Tuning (LIFT) framework, which enhances long-context understanding in short-context LLMs by fine-tuning inputs into model parameters and incorporating Gated Memory to balance memorization with in-context learning capabilities.

Authors:Haoyu Wang, Tong Teng, Tianyu Guo, An Xiao, Duyu Tang, Hanting Chen, Yunhe Wang
Title: Unshackling Context Length: An Efficient Selective Attention Approach through Query-Key Compression
Abstract:
Handling long-context sequences efficiently remains a significant challenge in large language models (LLMs). Existing methods for token selection in sequence extrapolation either employ a permanent eviction strategy or select tokens by chunk, which may lead to the loss of critical information. We propose Efficient Selective Attention (ESA), a novel approach that extends context length by efficiently selecting the most critical tokens at the token level to compute attention. ESA reduces the computational complexity of token selection by compressing query and key vectors into lower-dimensional representations. We evaluate ESA on long sequence benchmarks with maximum lengths up to 256k using open-source LLMs with context lengths of 8k and 32k. ESA outperforms other selective attention methods, especially in tasks requiring the retrieval of multiple pieces of information, achieving comparable performance to full-attention extrapolation methods across various tasks, with superior results in certain tasks.
中文: 提出的高效选择性注意力(ESA)方法通过令牌级关键令牌选择和压缩查询-键表示的降维处理,解决了大语言模型中的长上下文处理难题,在长序列基准测试中优于现有方法,并在多项任务中达到与全注意力相当甚至更优的性能。
English: The proposed Efficient Selective Attention (ESA) method addresses long-context challenges in LLMs by selecting critical tokens at the token level and reducing computational complexity through compressed query-key representations, outperforming existing methods in long-sequence benchmarks while matching full-attention performance in various tasks.

Authors:Donghao Luo, Yujie Liang, Xu Peng, Xiaobin Hu, Boyuan Jiang, Chengming Xu, Taisong Jin, Chengjie Wang, Yanwei Fu
Title: CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors
Abstract:
Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like reasoning, which involves addressing size mismatches between garments and models while recognizing and leveraging the distinct functionalities of various regions within the model images. To address this issue, we draw inspiration from human cognitive processes and disentangle the complex reasoning required for cross-category try-on into a structured framework. This framework systematically decomposes the model image into three distinct regions: try-on, reconstruction, and imagination zones. Each zone plays a specific role in accommodating the garment and facilitating realistic synthesis. To endow the model with robust reasoning capabilities for cross-category scenarios, we propose an iterative data constructor. This constructor encompasses diverse scenarios, including intra-category try-on, any-to-dress transformations (replacing any garment category with a dress), and dress-to-any transformations (replacing a dress with another garment category). Utilizing the generated dataset, we introduce a tri-zone priors generator that intelligently predicts the try-on, reconstruction, and imagination zones by analyzing how the input garment is expected to align with the model image. Guided by these tri-zone priors, our proposed method, CrossVTON, achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations. Notably, it demonstrates superior capability in handling cross-category virtual try-on, meeting the complex demands of real-world applications.
中文:CrossVTON受人类认知启发,提出结构化框架将模特图像分解为三个功能区,并通过迭代数据构造器增强跨类别虚拟试穿的推理能力,实现了业界领先的性能表现。
English: CrossVTON introduces a structured framework inspired by human cognition to address cross-category virtual try-on challenges by decomposing model images into three functional zones and utilizing an iterative data constructor for robust reasoning, achieving state-of-the-art performance.

Authors:Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin
Title: NLP-AKG: Few-Shot Construction of NLP Academic Knowledge Graph Based on LLM
Abstract:
Large language models (LLMs) have been widely applied in question answering over scientific research papers. To enhance the professionalism and accuracy of responses, many studies employ external knowledge augmentation. However, existing structures of external knowledge in scientific literature often focus solely on either paper entities or domain concepts, neglecting the intrinsic connections between papers through shared domain concepts. This results in less comprehensive and specific answers when addressing questions that combine papers and concepts. To address this, we propose a novel knowledge graph framework that captures deep conceptual relations between academic papers, constructing a relational network via intra-paper semantic elements and inter-paper citation relations. Using a few-shot knowledge graph construction method based on LLM, we develop NLP-AKG, an academic knowledge graph for the NLP domain, by extracting 620,353 entities and 2,271,584 relations from 60,826 papers in ACL Anthology. Based on this, we propose a 'sub-graph community summary' method and validate its effectiveness on three NLP scientific literature question answering datasets.
中文摘要:本文提出了一种新颖的知识图谱框架,通过整合论文实体与领域概念来捕捉科学文献中的深层概念关联,解决了现有方法的局限性,并在自然语言处理领域的问答任务中验证了其有效性。
English Summary: This paper introduces a novel knowledge graph framework that integrates both paper entities and domain concepts to capture deep conceptual relations in scientific literature, addressing limitations of existing methods and demonstrating improved performance on NLP question answering tasks.

Authors:Jie Zou, Mohammad Aliannejadi, Evangelos Kanoulas, Shuxi Han, Heli Ma, Zheng Wang, Yang Yang, Heng Tao Shen
Title: PSCon: Product Search Through Conversations
Abstract:
Conversational Product Search ( CPS ) systems interact with users via natural language to offer personalized and context-aware product lists. However, most existing research on CPS is limited to simulated conversations, due to the lack of a real CPS dataset driven by human-like language. Moreover, existing conversational datasets for e-commerce are constructed for a particular market or a particular language and thus can not support cross-market and multi-lingual usage. In this paper, we propose a CPS data collection protocol and create a new CPS dataset, called PSCon, which assists product search through conversations with human-like language. The dataset is collected by a coached human-human data collection protocol and is available for dual markets and two languages. By formulating the task of CPS, the dataset allows for comprehensive and in-depth research on six subtasks: user intent detection, keyword extraction, system action prediction, question selection, item ranking, and response generation. Moreover, we present a concise analysis of the dataset and propose a benchmark model on the proposed CPS dataset. Our proposed dataset and model will be helpful for facilitating future research on CPS.
Chinese: 本文提出了PSCon这一新型对话式产品搜索数据集,通过人工对话收集,克服了现有模拟对话及单一市场数据集的局限,支持双市场双语环境下的六项子任务研究,并提供了基准模型以推动该领域发展。
English: This paper introduces PSCon, a new conversational product search dataset created through human-human interactions to address the limitations of existing simulated and single-market datasets, enabling comprehensive research across six subtasks and dual markets in two languages.

Authors:Liyang He, Chenglong Liu, Rui Li, Zhenya Huang, Shulan Ruan, Jun Zhou, Enhong Chen
Title: Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models
Abstract:
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using annotated datasets like NLI. Yet, the reliance on manual labels limits scalability. Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency. However, they overlook ranking information crucial for fine-grained semantic distinctions. To tackle this challenge, we propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence. Then, we refine exist sentence embedding model by integrating ranking information and semantic information. Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
中文摘要:我们通过在潜在空间中控制大语言模型生成以融入排序信息的方法,以较低的合成成本在多个基准测试中实现了最先进的句子嵌入性能。
English Summary: Our method enhances sentence embeddings by controlling LLM generation in the latent space to incorporate ranking information, achieving state-of-the-art performance on benchmarks with minimal synthesis cost.

Authors:Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Liang Chen, Zibin Zheng
Title: Are Large Language Models In-Context Graph Learners?
Abstract:
Large language models (LLMs) have demonstrated remarkable in-context reasoning capabilities across a wide range of tasks, particularly with unstructured inputs such as language or images. However, LLMs struggle to handle structured data, such as graphs, due to their lack of understanding of non-Euclidean structures. As a result, without additional fine-tuning, their performance significantly lags behind that of graph neural networks (GNNs) in graph learning tasks. In this paper, we show that learning on graph data can be conceptualized as a retrieval-augmented generation (RAG) process, where specific instances (e.g., nodes or edges) act as queries, and the graph itself serves as the retrieved context. Building on this insight, we propose a series of RAG frameworks to enhance the in-context learning capabilities of LLMs for graph learning tasks. Comprehensive evaluations demonstrate that our proposed RAG frameworks significantly improve LLM performance on graph-based tasks, particularly in scenarios where a pretrained LLM must be used without modification or accessed via an API.
大语言模型擅长处理非结构化数据,但在处理图结构数据时表现欠佳,为此提出的检索增强生成框架无需微调模型即可显著提升其在图学习任务中的性能。
Large language models excel at processing unstructured data but fall short with structured graphs, prompting the development of retrieval-augmented generation frameworks that significantly boost their performance in graph learning tasks without requiring model fine-tuning.

Authors:Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
Title: PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection
Abstract:
Visual instruction tuning refines pre-trained Multimodal Large Language Models (MLLMs) to enhance their real-world task performance. However, the rapid expansion of visual instruction datasets introduces significant data redundancy, leading to excessive computational costs. Existing data selection methods predominantly rely on proxy models or loss-based metrics, both of which impose substantial computational overheads due to the necessity of model inference and backpropagation. To address this challenge, we propose PRISM, a novel training-free approach for efficient multimodal data selection. Unlike existing methods, PRISM eliminates the reliance on proxy models, warm-up pretraining, and gradient-based optimization. Instead, it leverages Pearson correlation analysis to quantify the intrinsic visual encoding properties of MLLMs, computing a task-specific correlation score to identify high-value instances. This not only enbles data-efficient selection,but maintains the original performance. Empirical evaluations across multiple MLLMs demonstrate that PRISM reduces the overall time required for visual instruction tuning and data selection to just 30% of conventional methods, while surpassing fully fine-tuned models across eight multimodal and three language understanding benchmarks, achieving a 101.7% relative improvement in final performance.
中文摘要:PRISM是一种无需训练的方法,通过皮尔逊相关性分析高效筛选视觉指令数据,将计算成本降低70%的同时提升模型性能。
English Summary: PRISM is a training-free method that uses Pearson correlation to efficiently select high-value data for visual instruction tuning, reducing computational costs by 70% while improving model performance.

Authors:Feng Li, Yuan Bi, Dianye Huang, Zhongliang Jiang, Nassir Navab
Title: Robotic CBCT Meets Robotic Ultrasound
Abstract:
The multi-modality imaging system offers optimal fused images for safe and precise interventions in modern clinical practices, such as computed tomography - ultrasound (CT-US) guidance for needle insertion. However, the limited dexterity and mobility of current imaging devices hinder their integration into standardized workflows and the advancement toward fully autonomous intervention systems. In this paper, we present a novel clinical setup where robotic cone beam computed tomography (CBCT) and robotic US are pre-calibrated and dynamically co-registered, enabling new clinical applications. This setup allows registration-free rigid registration, facilitating multi-modal guided procedures in the absence of tissue deformation. First, a one-time pre-calibration is performed between the systems. To ensure a safe insertion path by highlighting critical vasculature on the 3D CBCT, SAM2 segments vessels from B-mode images, using the Doppler signal as an autonomously generated prompt. Based on the registration, the Doppler image or segmented vessel masks are then mapped onto the CBCT, creating an optimally fused image with comprehensive detail. To validate the system, we used a specially designed phantom, featuring lesions covered by ribs and multiple vessels with simulated moving flow. The mapping error between US and CBCT resulted in an average deviation of 1.72+-0.62 mm. A user study demonstrated the effectiveness of CBCT-US fusion for needle insertion guidance, showing significant improvements in time efficiency, accuracy, and success rate. Needle intervention performance improved by approximately 50% compared to the conventional US-guided workflow. We present the first robotic dual-modality imaging system designed to guide clinical applications. The results show significant performance improvements compared to traditional manual interventions.
中文: 本文提出了一种新型机器人双模态成像系统,通过预校准实现CBCT与超声的无配准融合,在针穿刺干预中展现出约50%的性能提升,显著提高了临床操作的准确性和效率。
English: This paper introduces a novel robotic dual-modality imaging system combining CBCT and ultrasound with pre-calibration for registration-free fusion, demonstrating approximately 50% improvement in needle intervention performance through enhanced accuracy and efficiency in clinical procedures.

Authors:Runxuan Liu, Bei Luo, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin
Title: Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering
Abstract:
Large language models (LLMs) have shown remarkable capabilities in natural language processing. However, in knowledge graph question answering tasks (KGQA), there remains the issue of answering questions that require multi-hop reasoning. Existing methods rely on entity vector matching, but the purpose of the question is abstract and difficult to match with specific entities. As a result, it is difficult to establish reasoning paths to the purpose, which leads to information loss and redundancy. To address this issue, inspired by human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a novel framework that constructs reasoning paths from purposes back to conditions. ORT operates in three key phases: (1) using LLM to extract purpose labels and condition labels, (2) constructing label reasoning paths based on the KG ontology, and (3) using the label reasoning paths to guide knowledge retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves state-of-the-art performance and significantly enhances the capability of LLMs for KGQA.
中文摘要:提出的本体引导逆向思维框架通过从目标反向构建至条件的推理路径,解决了知识图谱问答中的多跳推理难题,在基准数据集上实现了最优性能。
English Summary: The proposed Ontology-Guided Reverse Thinking (ORT) framework addresses multi-hop reasoning challenges in KGQA by constructing backward reasoning paths from purposes to conditions, achieving state-of-the-art performance on benchmark datasets.

Authors:Cheng Qian, Emre Can Acikgoz, Hongru Wang, Xiusi Chen, Avirup Sil, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji
Title: SMART: Self-Aware Agent for Tool Overuse Mitigation
Abstract:
Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to Tool Overuse, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce SMART (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent's self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce SMART-ER, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop SMARTAgent, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4o. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.
中文摘要:SMART范式通过增强大语言模型代理的自我意识,使其能够策略性地平衡参数知识与工具使用,将工具调用减少24%的同时提升性能超过37%,并使较小模型达到与大型模型相当的能力水平。
English Summary: The SMART paradigm enhances LLM agents' self-awareness to strategically balance parametric knowledge and tool use, reducing tool overuse by 24% while improving performance by over 37% and enabling smaller models to match larger counterparts' capabilities.

Authors:Zhenheng Tang, Zichen Tang, Junlin Huang, Xinglin Pan, Rudan Yan, Yuxin Wang, Amelie Chi Zhou, Shaohuai Shi, Xiaowen Chu, Bo Li
Title: DreamDDP: Accelerating Data Parallel Distributed LLM Training with Layer-wise Scheduled Partial Synchronization
Abstract:
The growth of large language models (LLMs) increases challenges of accelerating distributed training across multiple GPUs in different data centers. Moreover, concerns about data privacy and data exhaustion have heightened interest in geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-bandwidth environments. Local SGD mitigates communication overhead by reducing synchronization frequency, and recent studies have successfully applied it to geo-distributedly pre-train LLMs. However, we identify that its model synchronization mechanism prevents overlapping communication and computation, which makes the system lose opportunities to overlap communication and computation. To overcome this limitation, we expand the design space of local SGD by layer-wisely decoupling model synchronization. In each iteration, only some layers are synchronized instead of the entire model after a specific number of iterations. Leveraging this methodology, we introduce DreamDDP, a training framework to accelerate low-bandwidth distributed training with three key innovations: (1) partial local SGD with theoretical assurances of convergence rates comparable to S-SGD; (2) overlapping parameter synchronization with computation without extra GPU memory occupation; (3) identifying and exploiting three properties to schedule the communication and computation to reduce the training time based on fine-grained profiling of layer-wise communication and computation time. Empirical evaluations conducted on 32 GPUs using prominent deep learning models, including ResNet-18, ResNet-50, GPT-2, and Llama-2, demonstrate that DreamDDP enhances the convergence properties of Local SGD (and Adam) and achieves speedups ranging from $1.49\times$ to $3.91\times$ over leading baseline methods.
中文:提出的DreamDDP框架通过引入分层同步机制实现通信与计算重叠,在保证收敛性的同时将地理分布式训练速度最高提升3.91倍。
English: The proposed DreamDDP framework accelerates geo-distributed training by introducing layer-wise synchronization that enables communication-computation overlap, achieving up to 3.91× speedup while maintaining convergence guarantees.

Authors:Yunzhuo Chen, Jordan Vice, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian
Title: Image Watermarking of Generative Diffusion Models
Abstract:
Embedding watermarks into the output of generative models is essential for establishing copyright and verifiable ownership over the generated content. Emerging diffusion model watermarking methods either embed watermarks in the frequency domain or offer limited versatility of the watermark patterns in the image space, which allows simplistic detection and removal of the watermarks from the generated content. To address this issue, we propose a watermarking technique that embeds watermark features into the diffusion model itself. Our technique enables training of a paired watermark extractor for a generative model that is learned through an end-to-end process. The extractor forces the generator, during training, to effectively embed versatile, imperceptible watermarks in the generated content while simultaneously ensuring their precise recovery. We demonstrate highly accurate watermark embedding/detection and show that it is also possible to distinguish between different watermarks embedded with our method to differentiate between generative models.
中文: 本文提出了一种新颖的水印技术,将多样且不可察觉的水印直接嵌入扩散模型中,通过端到端训练的提取器实现精确恢复,并能区分不同生成模型。
English: This paper introduces a novel watermarking technique that embeds versatile and imperceptible watermarks directly into the diffusion model, enabling precise recovery and differentiation between generative models through an end-to-end trained extractor.

Authors:Hongye Cao, Fan Feng, Tianpei Yang, Jing Huo, Yang Gao
Title: Causal Information Prioritization for Efficient Reinforcement Learning
Abstract:
Current Reinforcement Learning (RL) methods often suffer from sample-inefficiency, resulting from blind exploration strategies that neglect causal relationships among states, actions, and rewards. Although recent causal approaches aim to address this problem, they lack grounded modeling of reward-guided causal understanding of states and actions for goal-orientation, thus impairing learning efficiency. To tackle this issue, we propose a novel method named Causal Information Prioritization (CIP) that improves sample efficiency by leveraging factored MDPs to infer causal relationships between different dimensions of states and actions with respect to rewards, enabling the prioritization of causal information. Specifically, CIP identifies and leverages causal relationships between states and rewards to execute counterfactual data augmentation to prioritize high-impact state features under the causal understanding of the environments. Moreover, CIP integrates a causality-aware empowerment learning objective, which significantly enhances the agent's execution of reward-guided actions for more efficient exploration in complex environments. To fully assess the effectiveness of CIP, we conduct extensive experiments across 39 tasks in 5 diverse continuous control environments, encompassing both locomotion and manipulation skills learning with pixel-based and sparse reward settings. Experimental results demonstrate that CIP consistently outperforms existing RL methods across a wide range of scenarios.
中文: 提出的因果信息优先方法通过识别状态与奖励间的因果关系来优先处理关键信息并进行反事实数据增强,在39项任务中的实验验证了该方法在各种场景下均优于现有强化学习方法。
English: The proposed Causal Information Prioritization (CIP) method enhances reinforcement learning efficiency by identifying causal relationships between states and rewards to prioritize impactful information and enable counterfactual data augmentation, with experiments across 39 tasks confirming its consistent superiority over existing approaches.

Authors:Hongye Cao, Fan Feng, Meng Fang, Shaokang Dong, Tianpei Yang, Jing Huo, Yang Gao
Title: Towards Empowerment Gain through Causal Structure Learning in Model-Based RL
Abstract:
In Model-Based Reinforcement Learning (MBRL), incorporating causal structures into dynamics models provides agents with a structured understanding of the environments, enabling efficient decision. Empowerment as an intrinsic motivation enhances the ability of agents to actively control their environments by maximizing the mutual information between future states and actions. We posit that empowerment coupled with causal understanding can improve controllability, while enhanced empowerment gain can further facilitate causal reasoning in MBRL. To improve learning efficiency and controllability, we propose a novel framework, Empowerment through Causal Learning (ECL), where an agent with the awareness of causal dynamics models achieves empowerment-driven exploration and optimizes its causal structure for task learning. Specifically, ECL operates by first training a causal dynamics model of the environment based on collected data. We then maximize empowerment under the causal structure for exploration, simultaneously using data gathered through exploration to update causal dynamics model to be more controllable than dense dynamics model without causal structure. In downstream task learning, an intrinsic curiosity reward is included to balance the causality, mitigating overfitting. Importantly, ECL is method-agnostic and is capable of integrating various causal discovery methods. We evaluate ECL combined with 3 causal discovery methods across 6 environments including pixel-based tasks, demonstrating its superior performance compared to other causal MBRL methods, in terms of causal discovery, sample efficiency, and asymptotic performance.
中文: 提出的因果学习赋权(ECL)框架将因果动态模型与基于赋权的探索相结合,在模型强化学习中展现出在因果发现、样本效率和任务性能方面的卓越表现。
English: The proposed Empowerment through Causal Learning (ECL) framework integrates causal dynamics models with empowerment-driven exploration in model-based reinforcement learning, demonstrating superior performance in causal discovery, sample efficiency, and task performance across diverse environments.

Authors:Guhao Feng, Yihan Geng, Jian Guan, Wei Wu, Liwei Wang, Di He
Title: Theoretical Benefit and Limitation of Diffusion Language Model
Abstract:
Diffusion language models have emerged as a promising approach for text generation. One would naturally expect this method to be an efficient replacement for autoregressive models since multiple tokens can be sampled in parallel during each diffusion step. However, its efficiency-accuracy trade-off is not yet well understood. In this paper, we present a rigorous theoretical analysis of a widely used type of diffusion language model, the Masked Diffusion Model (MDM), and find that its effectiveness heavily depends on the target evaluation metric. Under mild conditions, we prove that when using perplexity as the metric, MDMs can achieve near-optimal perplexity in sampling steps regardless of sequence length, demonstrating that efficiency can be achieved without sacrificing performance. However, when using the sequence error rate--which is important for understanding the "correctness" of a sequence, such as a reasoning chain--we show that the required sampling steps must scale linearly with sequence length to obtain "correct" sequences, thereby eliminating MDM's efficiency advantage over autoregressive models. Our analysis establishes the first theoretical foundation for understanding the benefits and limitations of MDMs. All theoretical findings are supported by empirical studies.
中文摘要:扩散语言模型能以较少的采样步骤达到接近最优的困惑度,但在保持低序列错误率时需要随序列长度线性增加采样步骤,从而削弱了其相对于自回归模型的效率优势。
English Summary: Diffusion language models can achieve near-optimal perplexity with few sampling steps, but require linearly increasing steps with sequence length to maintain low sequence error rates, limiting their efficiency advantage over autoregressive models.

Authors:Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
Title: EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
Abstract:
Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9\% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available at https://embodiedbench.github.io.
Chinese: EmbodiedBench作为一个全面的基准被提出,用于评估视觉驱动的具身智能体,发现多模态大语言模型擅长高层次任务但在低层次操作上表现不佳,最优模型的平均得分仅为28.9%。
English: EmbodiedBench is introduced as a comprehensive benchmark to evaluate vision-driven embodied agents, revealing that MLLMs excel in high-level tasks but struggle with low-level manipulation, with the top model achieving only a 28.9% average score.

Authors:Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian
Title: Dynamic watermarks in images generated by diffusion models
Abstract:
High-fidelity text-to-image diffusion models have revolutionized visual content generation, but their widespread use raises significant ethical concerns, including intellectual property protection and the misuse of synthetic media. To address these challenges, we propose a novel multi-stage watermarking framework for diffusion models, designed to establish copyright and trace generated images back to their source. Our multi-stage watermarking technique involves embedding: (i) a fixed watermark that is localized in the diffusion model's learned noise distribution and, (ii) a human-imperceptible, dynamic watermark in generates images, leveraging a fine-tuned decoder. By leveraging the Structural Similarity Index Measure (SSIM) and cosine similarity, we adapt the watermark's shape and color to the generated content while maintaining robustness. We demonstrate that our method enables reliable source verification through watermark classification, even when the dynamic watermark is adjusted for content-specific variations. Source model verification is enabled through watermark classification. o support further research, we generate a dataset of watermarked images and introduce a methodology to evaluate the statistical impact of watermarking on generated content.Additionally, we rigorously test our framework against various attack scenarios, demonstrating its robustness and minimal impact on image quality. Our work advances the field of AI-generated content security by providing a scalable solution for model ownership verification and misuse prevention.
中文: 本文提出了一种针对扩散模型的多阶段水印框架,通过嵌入固定和动态水印实现可靠的来源验证与版权保护,在抵御多种攻击的同时保持图像质量。
English: This paper introduces a multi-stage watermarking framework for diffusion models that embeds both fixed and dynamic watermarks to enable robust source verification and copyright protection while maintaining image quality against various attacks.

Authors:Bin Wu, Yihang Wang, Yuanhao Zeng, Jiawei Liu, Jiashu Zhao, Cheng Yang, Yawen Li, Long Xia, Dawei Yin, Chuan Shi
Title: Graph Foundation Models for Recommendation: A Comprehensive Survey
Abstract:
Recommender systems (RS) serve as a fundamental tool for navigating the vast expanse of online information, with deep learning advancements playing an increasingly important role in improving ranking accuracy. Among these, graph neural networks (GNNs) excel at extracting higher-order structural information, while large language models (LLMs) are designed to process and comprehend natural language, making both approaches highly effective and widely adopted. Recent research has focused on graph foundation models (GFMs), which integrate the strengths of GNNs and LLMs to model complex RS problems more efficiently by leveraging the graph-based structure of user-item relationships alongside textual understanding. In this survey, we provide a comprehensive overview of GFM-based RS technologies by introducing a clear taxonomy of current approaches, diving into methodological details, and highlighting key challenges and future directions. By synthesizing recent advancements, we aim to offer valuable insights into the evolving landscape of GFM-based recommender systems.
中文: 本综述系统探讨了融合图神经网络与大语言模型的图基础模型在推荐系统中的应用,通过方法分类揭示了技术挑战与发展方向。
English: This survey comprehensively examines graph foundation models (GFMs) that integrate graph neural networks and large language models to enhance recommender systems, presenting a taxonomy of methods while addressing challenges and future directions.

Authors:Wei Cheng, Yucheng Lu, Boyang Xia, Jiangxia Cao, Kuan Xu, Mingxing Wen, Wei Jiang, Jiaming Zhang, Zhaojie Liu, Liyin Hong, Kun Gai, Guorui Zhou
Title: ChorusCVR: Chorus Supervision for Entire Space Post-Click Conversion Rate Modeling
Abstract:
Post-click conversion rate (CVR) estimation is a vital task in many recommender systems of revenue businesses, e.g., e-commerce and advertising. In a perspective of sample, a typical CVR positive sample usually goes through a funnel of exposure to click to conversion. For lack of post-event labels for un-clicked samples, CVR learning task commonly only utilizes clicked samples, rather than all exposed samples as for click-through rate (CTR) learning task. However, during online inference, CVR and CTR are estimated on the same assumed exposure space, which leads to a inconsistency of sample space between training and inference, i.e., sample selection bias (SSB). To alleviate SSB, previous wisdom proposes to design novel auxiliary tasks to enable the CVR learning on un-click training samples, such as CTCVR and counterfactual CVR, etc. Although alleviating SSB to some extent, none of them pay attention to the discrimination between ambiguous negative samples (un-clicked) and factual negative samples (clicked but un-converted) during modelling, which makes CVR model lacks robustness. To full this gap, we propose a novel ChorusCVR model to realize debiased CVR learning in entire-space.
中文: 摘要探讨了点击后转化率估计中的样本选择偏差问题,并提出了ChorusCVR模型,该模型通过区分模糊负样本和事实负样本,实现了在全样本空间的无偏CVR学习。
English: The abstract discusses the issue of sample selection bias in post-click conversion rate estimation and introduces ChorusCVR, a novel model designed to achieve unbiased CVR learning across the entire sample space by addressing the discrimination between ambiguous and factual negative samples.

Authors:Yunzhuo Chen, Naveed Akhtar, Nur Al Hasan Haldar, Ajmal Mian
Title: Deepfake Detection with Spatio-Temporal Consistency and Attention
Abstract:
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We evaluate our method on two popular large data sets and achieve significant performance over the state-of-the-art methods.Moreover, our technique also provides memory and computational advantages over the competitive techniques.
中文: 本文提出一种神经Deepfake检测器,通过聚焦伪造视频在单帧和帧序列中的局部篡改特征,结合空间与时序注意力机制,在提升检测性能的同时兼具计算效率优势。
English: This paper introduces a neural Deepfake detector that enhances detection by focusing on localized manipulative signatures in both individual frames and frame sequences, using spatial and temporal attention mechanisms to outperform existing methods with improved efficiency.

Authors:Jiahao You, Ziye Jia, Chao Dong, Qihui Wu, Zhu Han
Title: Generative AI-Enhanced Cooperative MEC of UAVs and Ground Stations for Unmanned Surface Vehicles
Abstract:
The increasing deployment of unmanned surface vehicles (USVs) require computational support and coverage in applications such as maritime search and rescue. Unmanned aerial vehicles (UAVs) can offer low-cost, flexible aerial services, and ground stations (GSs) can provide powerful supports, which can cooperate to help the USVs in complex scenarios. However, the collaboration between UAVs and GSs for USVs faces challenges of task uncertainties, USVs trajectory uncertainties, heterogeneities, and limited computational resources. To address these issues, we propose a cooperative UAV and GS based robust multi-access edge computing framework to assist USVs in completing computational tasks. Specifically, we formulate the optimization problem of joint task offloading and UAV trajectory to minimize the total execution time, which is in the form of mixed integer nonlinear programming and NP-hard to tackle. Therefore, we propose the algorithm of generative artificial intelligence-enhanced heterogeneous agent proximal policy optimization (GAI-HAPPO). The proposed algorithm integrates GAI models to enhance the actor network ability to model complex environments and extract high-level features, thereby allowing the algorithm to predict uncertainties and adapt to dynamic conditions. Additionally, GAI stabilizes the critic network, addressing the instability of multi-agent reinforcement learning approaches. Finally, extensive simulations demonstrate that the proposed algorithm outperforms the existing benchmark methods, thus highlighting the potentials in tackling intricate, cross-domain issues in the considered scenarios.
中文摘要:本文提出了一种基于无人机与地面站协作的鲁棒多接入边缘计算框架,通过生成式人工智能增强的异构智能体近端策略优化算法,有效解决了海上无人艇在复杂场景中的计算资源协同挑战。
English Summary: This paper introduces a robust multi-access edge computing framework using cooperative UAVs and ground stations to assist unmanned surface vehicles with computational tasks, employing a generative AI-enhanced algorithm that outperforms existing methods in dynamic maritime scenarios.

Authors:Hao Lin, Mustafa A. Kishk, Mohamed-Slim Alouini
Title: Performance Analysis of Infrastructure Sharing Techniques in Cellular Networks: A Percolation Theory Approach
Abstract:
In the context of 5G, infrastructure sharing has been identified as a potential solution to reduce the investment costs of cellular networks. In particular, it can help low-income regions build 5G networks more affordably and further bridge the digital divide. There are two main kinds of infrastructure sharing: passive sharing (i.e. site sharing) and active sharing (i.e. access sharing), which require mobile network operators (MNOs) to share their non-electronic elements or electronic elements, respectively. Because co-construction and sharing can achieve broader coverage with lower investment, through percolation theory, we investigate how different sharing strategies can deliver large-scale continuous services. First, we examine the percolation characteristics in signal-to-interference-plus-noise ratio (SINR) coverage graphs and the necessary conditions for percolation. Second, we propose an 'average coverage radius' to approximate the SINR graph with a low base station (BS) density based on the Gilbert disk model. Finally, we estimate the critical conditions of BS densities of MNOs for different sharing strategies and compare the percolation probabilities under different infrastructure sharing strategies.
中文摘要:5G基础设施共享通过被动和主动共享策略降低网络投资成本并缩小数字鸿沟,利用渗流理论分析不同共享模式下基站密度的临界条件,以实现大规模连续服务覆盖。
English Summary: Infrastructure sharing in 5G networks reduces investment costs and bridges the digital divide by enabling broader coverage through passive and active sharing strategies, with percolation theory used to analyze critical base station densities for continuous service delivery.

Authors:Hao Lin, Mustafa A. Kishk, Mohamed-Slim Alouini
Title: Connectivity of LEO Satellite Mega Constellations: An Application of Percolation Theory on a Sphere
Abstract:
With the advent of the 6G era, global connectivity has become a common goal in the evolution of communications, aiming to bring Internet services to more unconnected regions. Additionally, the rise of applications such as the Internet of Everything and remote education also requires global connectivity. Non-terrestrial networks (NTN), particularly low earth orbit (LEO) satellites, play a crucial role in this future vision. Although some literature already analyze the coverage performance using stochastic geometry, the ability of generating large-scale continuous service area is still expected to analyze. Therefore, in this paper, we mainly investigate the necessary conditions of LEO satellite deployment for large-scale continuous service coverage on the earth. Firstly, we apply percolation theory to a closed spherical surface and define the percolation on a sphere for the first time. We introduce the sub-critical and super-critical cases to prove the existence of the phase transition of percolation probability. Then, through stereographic projection, we introduce the tight bounds and closed-form expression of the critical number of LEO satellites on the same constellation. In addition, we also investigate how the altitude and maximum slant range of LEO satellites affect percolation probability, and derive the critical values of them. Based on our findings, we provide useful recommendations for companies planning to deploy LEO satellite networks to enhance connectivity.
中文摘要:本文应用渗流理论首次定义了球面渗流现象,推导出低轨卫星实现全球连续服务覆盖的关键部署条件,为卫星网络建设提供了重要参考。
English Summary: This paper explores the deployment requirements for LEO satellites to achieve large-scale continuous service coverage using percolation theory, deriving critical parameters and offering practical recommendations for network planning.

Authors:Hao Lin, Ainur Zhaikhan, Mustafa A. Kishk, Hesham ElSawy, Mohamed-Slim Alouini
Title: Energy-as-a-Service for RF-Powered IoE Networks: A Percolation Theory Approach
Abstract:
Due to the involved massive number of devices, radio frequency (RF) energy harvesting is indispensable to realize the foreseen Internet-of-Everything (IoE) within 6G networks. Analogous to the cellular networks concept, shared energy stations (ESs) are foreseen to supply energy-as-a-service (EaaS) in order to recharge devices that belong to different IoE operators who are offering diverse use cases. Considering the capital expenditure (CAPEX) for ES deployment along with their finite wireless energy transfer (WET) zones, spatial energy gaps are plausible. Furthermore, the ESs deployment cannot cover 100% of the energy-harvesting devices of all coexisting IoE use cases. In this context, we utilize percolation theory to characterize the feasibility of large-scale device-to-device (D2D) connectivity of IoE networks operating under EaaS platforms. Assuming that ESs and IoE devices follow independent Poisson point processes (PPPs), we construct a connectivity graph for the IoE devices that are within the WET zones of ESs. Continuum percolation on the construct graph is utilized to derive necessary and sufficient conditions for large-scale RF-powered D2D connectivity in terms of the required IoE device density and communication range along with the required ESs density and WET zone size. Fixing the IoE network parameters along with the size of WET zones, we obtain the approximate critical value of the ES density that ensures large-scale connectivity using the inner-city and Gilbert disk models. By imitating the bounds and combining the approximations, we construct an approximate expression for the critical ES density function, which is necessary to minimize the EaaS CAPEX under the IoE connectivity constraint.
中文: 射频能量收集对实现6G万物互联至关重要,本研究应用渗流理论确定了在保证大规模设备连接的同时最小化部署成本所需的共享能量站临界密度。
English: RF energy harvesting is essential for 6G Internet-of-Everything networks, and this study uses percolation theory to determine the critical density of shared energy stations needed to ensure large-scale device connectivity while minimizing deployment costs.

Authors:Lotfi Abdelkrim Mecharbat, Alberto Marchisio, Muhammad Shafique, Mohammad M. Ghassemi, Tuka Alhanai
Title: MoENAS: Mixture-of-Expert based Neural Architecture Search for jointly Accurate, Fair, and Robust Edge Deep Neural Networks
Abstract:
There has been a surge in optimizing edge Deep Neural Networks (DNNs) for accuracy and efficiency using traditional optimization techniques such as pruning, and more recently, employing automatic design methodologies. However, the focus of these design techniques has often overlooked critical metrics such as fairness, robustness, and generalization. As a result, when evaluating SOTA edge DNNs' performance in image classification using the FACET dataset, we found that they exhibit significant accuracy disparities (14.09%) across 10 different skin tones, alongside issues of non-robustness and poor generalizability. In response to these observations, we introduce Mixture-of-Experts-based Neural Architecture Search (MoENAS), an automatic design technique that navigates through a space of mixture of experts to discover accurate, fair, robust, and general edge DNNs. MoENAS improves the accuracy by 4.02% compared to SOTA edge DNNs and reduces the skin tone accuracy disparities from 14.09% to 5.60%, while enhancing robustness by 3.80% and minimizing overfitting to 0.21%, all while keeping model size close to state-of-the-art models average size (+0.4M). With these improvements, MoENAS establishes a new benchmark for edge DNN design, paving the way for the development of more inclusive and robust edge DNNs.
中文: 当前边缘深度神经网络优化方法过于关注精度与效率,却忽视了公平性与鲁棒性,导致不同肤色间性能差异显著;提出的MoENAS技术通过提升精度、缩小差异并增强鲁棒性,同时保持模型规模,有效解决了这些问题。
English: Current edge DNN optimization methods prioritize accuracy and efficiency but neglect fairness and robustness, leading to significant performance disparities across skin tones; the proposed MoENAS technique addresses these issues by improving accuracy, reducing disparities, and enhancing robustness while maintaining model size.

Authors:Yao Wei, Matteo Toso, Pietro Morerio, Michael Ying Yang, Alessio Del Bue
Title: Functional 3D Scene Synthesis through Human-Scene Optimization
Abstract:
This paper presents a novel generative approach that outputs 3D indoor environments solely from a textual description of the scene. Current methods often treat scene synthesis as a mere layout prediction task, leading to rooms with overlapping objects or overly structured scenes, with limited consideration of the practical usability of the generated environment. Instead, our approach is based on a simple, but effective principle: we condition scene synthesis to generate rooms that are usable by humans. This principle is implemented by synthesizing 3D humans that interact with the objects composing the scene. If this human-centric scene generation is viable, the room layout is functional and it leads to a more coherent 3D structure. To this end, we propose a novel method for functional 3D scene synthesis, which consists of reasoning, 3D assembling and optimization. We regard text guided 3D synthesis as a reasoning process by generating a scene graph via a graph diffusion network. Considering object functional co-occurrence, a new strategy is designed to better accommodate human-object interaction and avoidance, achieving human-aware 3D scene optimization. We conduct both qualitative and quantitative experiments to validate the effectiveness of our method in generating coherent 3D scene synthesis results.
本文提出了一种以人为中心的生成方法,通过融入人-物交互来确保可用性和连贯性,从而根据文本描述创建功能性的3D室内场景。
This paper introduces a human-centric generative method that creates functional 3D indoor scenes from text descriptions by incorporating human-object interactions to ensure usability and coherence.

Authors:Xiantao Hu, Bineng Zhong, Qihua Liang, Zhiyi Mo, Liangtao Shi, Ying Tai, Jian Yang
Title: Adaptive Perception for Unified Visual Multi-modal Object Tracking
Abstract:
Recently, many multi-modal trackers prioritize RGB as the dominant modality, treating other modalities as auxiliary, and fine-tuning separately various multi-modal tasks. This imbalance in modality dependence limits the ability of methods to dynamically utilize complementary information from each modality in complex scenarios, making it challenging to fully perceive the advantages of multi-modal. As a result, a unified parameter model often underperforms in various multi-modal tracking tasks. To address this issue, we propose APTrack, a novel unified tracker designed for multi-modal adaptive perception. Unlike previous methods, APTrack explores a unified representation through an equal modeling strategy. This strategy allows the model to dynamically adapt to various modalities and tasks without requiring additional fine-tuning between different tasks. Moreover, our tracker integrates an adaptive modality interaction (AMI) module that efficiently bridges cross-modality interactions by generating learnable tokens. Experiments conducted on five diverse multi-modal datasets (RGBT234, LasHeR, VisEvent, DepthTrack, and VOT-RGBD2022) demonstrate that APTrack not only surpasses existing state-of-the-art unified multi-modal trackers but also outperforms trackers designed for specific multi-modal tasks.
Chinese Summary: APTrack提出了一种统一的多模态跟踪器,采用平等建模策略和自适应模态交互模块,无需针对不同任务进行微调即可动态适应多种模态,在多个数据集上实现了领先性能。
English Summary: APTrack introduces a unified multi-modal tracker with an equal modeling strategy and adaptive modality interaction module, enabling dynamic adaptation across modalities and tasks without fine-tuning, and achieving state-of-the-art performance on diverse datasets.

Authors:Yuan Bi, Yang Su, Nassir Navab, Zhongliang Jiang
Title: Gaze-Guided Robotic Vascular Ultrasound Leveraging Human Intention Estimation
Abstract:
Medical ultrasound has been widely used to examine vascular structure in modern clinical practice. However, traditional ultrasound examination often faces challenges related to inter- and intra-operator variation. The robotic ultrasound system (RUSS) appears as a potential solution for such challenges because of its superiority in stability and reproducibility. Given the complex anatomy of human vasculature, multiple vessels often appear in ultrasound images, or a single vessel bifurcates into branches, complicating the examination process. To tackle this challenge, this work presents a gaze-guided RUSS for vascular applications. A gaze tracker captures the eye movements of the operator. The extracted gaze signal guides the RUSS to follow the correct vessel when it bifurcates. Additionally, a gaze-guided segmentation network is proposed to enhance segmentation robustness by exploiting gaze information. However, gaze signals are often noisy, requiring interpretation to accurately discern the operator's true intentions. To this end, this study proposes a stabilization module to process raw gaze data. The inferred attention heatmap is utilized as a region proposal to aid segmentation and serve as a trigger signal when the operator needs to adjust the scanning target, such as when a bifurcation appears. To ensure appropriate contact between the probe and surface during scanning, an automatic ultrasound confidence-based orientation correction method is developed. In experiments, we demonstrated the efficiency of the proposed gaze-guided segmentation pipeline by comparing it with other methods. Besides, the performance of the proposed gaze-guided RUSS was also validated as a whole on a realistic arm phantom with an uneven surface.
中文摘要:本研究开发了一种视线引导的机器人超声系统,通过眼动追踪技术精确追踪血管结构并增强分割效果,有效解决了操作者差异和复杂血管解剖结构带来的挑战。
English Summary: This study introduces a gaze-guided robotic ultrasound system that uses eye-tracking to accurately follow vascular structures and enhance segmentation, addressing challenges of operator variability and complex vessel anatomy.

Authors:Yuchen Liu, Chen Chen, Lingjuan Lyu, Yaochu Jin, Gang Chen
Title: Exploit Gradient Skewness to Circumvent Byzantine Defenses for Federated Learning
Abstract:
Federated Learning (FL) is notorious for its vulnerability to Byzantine attacks. Most current Byzantine defenses share a common inductive bias: among all the gradients, the densely distributed ones are more likely to be honest. However, such a bias is a poison to Byzantine robustness due to a newly discovered phenomenon in this paper - gradient skew. We discover that a group of densely distributed honest gradients skew away from the optimal gradient (the average of honest gradients) due to heterogeneous data. This gradient skew phenomenon allows Byzantine gradients to hide within the densely distributed skewed gradients. As a result, Byzantine defenses are confused into believing that Byzantine gradients are honest. Motivated by this observation, we propose a novel skew-aware attack called STRIKE: first, we search for the skewed gradients; then, we construct Byzantine gradients within the skewed gradients. Experiments on three benchmark datasets validate the effectiveness of our attack
Chinese: 联邦学习极易受到拜占庭攻击,现有防御机制因梯度偏斜现象而失效,恶意梯度可混入诚实梯度中,为此提出的新型偏斜感知攻击STRIKE能有效利用此漏洞。
English: Federated Learning is highly vulnerable to Byzantine attacks, and current defenses are compromised by the gradient skew phenomenon, which allows malicious gradients to blend in with honest ones, prompting the development of a new skew-aware attack called STRIKE that effectively exploits this weakness.

Authors:Yu-Neng Chuang, Leisheng Yu, Guanchu Wang, Lizhe Zhang, Zirui Liu, Xuanting Cai, Yang Sui, Vladimir Braverman, Xia Hu
Title: Confident or Seek Stronger: Exploring Uncertainty-Based On-device LLM Routing From Benchmarking to Generalization
Abstract:
Large language models (LLMs) are increasingly deployed and democratized on edge devices. To improve the efficiency of on-device deployment, small language models (SLMs) are often adopted due to their efficient decoding latency and reduced energy consumption. However, these SLMs often generate inaccurate responses when handling complex queries. One promising solution is uncertainty-based SLM routing, offloading high-stakes queries to stronger LLMs when resulting in low-confidence responses on SLM. This follows the principle of "If you lack confidence, seek stronger support" to enhance reliability. Relying on more powerful LLMs is yet effective but increases invocation costs. Therefore, striking a routing balance between efficiency and efficacy remains a critical challenge. Additionally, efficiently generalizing the routing strategy to new datasets remains under-explored. In this paper, we conduct a comprehensive investigation into benchmarking and generalization of uncertainty-driven routing strategies from SLMs to LLMs over 1500+ settings. Our findings highlight: First, uncertainty-correctness alignment in different uncertainty quantification (UQ) methods significantly impacts routing performance. Second, uncertainty distributions depend more on both the specific SLM and the chosen UQ method, rather than downstream data. Building on the insight, we propose a calibration data construction instruction pipeline and open-source a constructed hold-out set to enhance routing generalization on new downstream scenarios. The experimental results indicate calibration data effectively bootstraps routing performance without any new data.
大型语言模型正越来越多地部署在边缘设备上,但为提升效率采用的小型语言模型在处理复杂查询时往往生成不准确回答,这促使需要基于不确定性的路由策略来平衡效率与准确性,同时实现对新数据集的泛化能力。
Large language models are increasingly deployed on edge devices, but small language models used for efficiency often produce inaccurate responses to complex queries, prompting the need for uncertainty-based routing strategies that balance efficiency and accuracy while generalizing to new datasets.

Authors:Kunfeng Lai, Zhenheng Tang, Xinglin Pan, Peijie Dong, Xiang Liu, Haolan Chen, Li Shen, Bo Li, Xiaowen Chu
Title: Mediator: Memory-efficient LLM Merging with Less Parameter Conflicts and Uncertainty Based Routing
Abstract:
Model merging aggregates Large Language Models (LLMs) finetuned on different tasks into a stronger one. However, parameter conflicts between models leads to performance degradation in averaging. While model routing addresses this issue by selecting individual models during inference, it imposes excessive storage and compute costs, and fails to leverage the common knowledge from different models. In this work, we observe that different layers exhibit varying levels of parameter conflicts. Building on this insight, we average layers with minimal parameter conflicts and use a novel task-level expert routing for layers with significant conflicts. To further reduce storage costs, inspired by task arithmetic sparsity, we decouple multiple fine-tuned experts into a dense expert and several sparse experts. Considering the out-of-distribution samples, we select and merge appropriate experts based on the task uncertainty of the input data. We conduct extensive experiments on both LLaMA and Qwen with varying parameter scales, and evaluate on real-world reasoning tasks. Results demonstrate that our method consistently achieves significant performance improvements while requiring less system cost compared to existing methods.
中文摘要:本文提出一种高效的模型融合方法,通过选择性合并低冲突层并结合任务级专家路由处理高冲突层,在显著降低存储成本的同时,有效提升了实际推理任务的性能表现。
English Summary: This paper introduces an efficient model merging method that selectively averages layers with minimal parameter conflicts and employs task-level expert routing for conflicting layers, significantly reducing storage costs while improving performance on real-world reasoning tasks.

Authors:Zhenqing Ling, Daoyuan Chen, Liuyi Yao, Qianli Shen, Yaliang Li, Ying Shen
Title: Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
Abstract:
Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.
Chinese Summary: 本研究提出一种新方法,赋予大语言模型双重身份以认知选择多样化数据进行自我调优,显著提升了跨领域和下游任务的性能。
English Summary: This study introduces a novel method that assigns large language models a dual role to cognitively select diverse data for self-tuning, significantly improving performance across domains and downstream tasks.

Authors:Xiyuan Wang, Yewei Liu, Lexi Pang, Siwei Chen, Muhan Zhang
Title: Do Graph Diffusion Models Accurately Capture and Generate Substructure Distributions?
Abstract:
Diffusion models have gained popularity in graph generation tasks; however, the extent of their expressivity concerning the graph distributions they can learn is not fully understood. Unlike models in other domains, popular backbones for graph diffusion models, such as Graph Transformers, do not possess universal expressivity to accurately model the distribution scores of complex graph data. Our work addresses this limitation by focusing on the frequency of specific substructures as a key characteristic of target graph distributions. When evaluating existing models using this metric, we find that they fail to maintain the distribution of substructure counts observed in the training set when generating new graphs. To address this issue, we establish a theoretical connection between the expressivity of Graph Neural Networks (GNNs) and the overall performance of graph diffusion models, demonstrating that more expressive GNN backbones can better capture complex distribution patterns. By integrating advanced GNNs into the backbone architecture, we achieve significant improvements in substructure generation.
中文摘要:图扩散模型因骨干网络表达能力有限而难以准确学习复杂图分布,但通过引入更具表达力的图神经网络,可显著提升其子结构生成能力。
English Summary: Graph diffusion models often fail to accurately capture complex graph distributions due to limited expressivity in their backbones, but integrating more expressive Graph Neural Networks significantly improves their ability to model substructure patterns.

Authors:Xiyuan Wang, Muhan Zhang
Title: Using Random Noise Equivariantly to Boost Graph Neural Networks Universally
Abstract:
Recent advances in Graph Neural Networks (GNNs) have explored the potential of random noise as an input feature to enhance expressivity across diverse tasks. However, naively incorporating noise can degrade performance, while architectures tailored to exploit noise for specific tasks excel yet lack broad applicability. This paper tackles these issues by laying down a theoretical framework that elucidates the increased sample complexity when introducing random noise into GNNs without careful design. We further propose Equivariant Noise GNN (ENGNN), a novel architecture that harnesses the symmetrical properties of noise to mitigate sample complexity and bolster generalization. Our experiments demonstrate that using noise equivariantly significantly enhances performance on node-level, link-level, subgraph, and graph-level tasks and achieves comparable performance to models designed for specific tasks, thereby offering a general method to boost expressivity across various graph tasks.
中文摘要:本文提出等变噪声图神经网络(ENGNN),该架构利用噪声对称性降低样本复杂度并增强泛化能力,在多种图任务中均表现出优异性能。
English Summary: This paper introduces Equivariant Noise GNN (ENGNN), a novel architecture that leverages noise symmetry to reduce sample complexity and improve generalization, achieving strong performance across diverse graph tasks.

Authors:Ziye Jia, Yilu Cao, Lijun He, Guangxia Li, Fuhui Zhou, Qihui Wu, Zhu Han
Title: NFV-Enabled Service Recovery in Space-Air-Ground Integrated Networks: A Matching Game Based Approach
Abstract:
To achieve ubiquitous connectivity of the sixth generation communication, the space-air-ground integrated network (SAGIN) is a popular topic. However, the dynamic nodes in SAGIN such as satellites and unmanned aerial vehicles, may be fragile and out of operation, which can potentially cause service failure. Therefore, the research on service recovery in SAGIN under situations of resource failure is critical. In order to facilitate the flexible resource utilization of SAGIN, the network function virtualization technology (NFV) is proposed to be employed. Firstly, the task management is transformed into the deployment of service function chains (SFCs). Then, we design an NFV-based SFC recovery model in SAGIN in the face of resource failure, so that tasks can quickly select alternative resources to complete deployments. Moreover, the problem of SFC recovery is formulated to minimize the total time consumption for all completed SFCs. Since it is an NP-hard integer linear programming problem, we propose the efficient recovery algorithm based on the matching game. Finally, via various simulations, the effectiveness of the proposed algorithm and its advantages are verified, where the total time consumption is optimized by about 25%, compared with other benchmark methods.
中文: 本研究提出基于网络功能虚拟化的服务功能链恢复模型,通过匹配博弈算法高效修复空天地一体化网络中的服务中断,相比基准方法将总耗时优化约25%。
English: The study proposes an NFV-based service function chain recovery model using a matching game algorithm to efficiently restore services in the dynamic space-air-ground integrated network, reducing total time consumption by approximately 25% compared to existing methods.

Authors:Xiang Liu, Zhenheng Tang, Hong Chen, Peijie Dong, Zeyu Li, Xiuze Zhou, Bo Li, Xuming Hu, Xiaowen Chu
Title: Can LLMs Maintain Fundamental Abilities under KV Cache Compression?
Abstract:
This paper investigates an underexplored challenge in large language models (LLMs): the impact of KV cache compression methods on LLMs' fundamental capabilities. Although existing methods achieve impressive compression ratios on long-context benchmarks, their effects on core model capabilities remain understudied. We present a comprehensive benchmark KVFundaBench to systematically evaluate the effects of KV cache compression across diverse fundamental LLM capabilities, spanning world knowledge, commonsense reasoning, arithmetic reasoning, code generation, safety, and long-context understanding and generation.Our analysis reveals serval key findings: (1) \textit{Task-Dependent Degradation}; (2) \textit{Model-Type Robustness} (3) \textit{Prompt Length Vulnerability}; (4) \textit{Chunk-Level Superiority}; (5) \textit{Prompt-Gain Sensitivity}; (6) \textit{Long-Context Generation Sensitivity}. Based on our analysis of attention patterns and cross-task compression performance, we propose ShotKV, a novel compression approach that distinctly handles prefill and decoding phases while maintaining shot-level semantic coherence. Empirical results show that ShotKV achieves $9\%$-$18\%$ performance improvements on long-context generation tasks under aggressive compression ratios.
本文提出一个基准来评估KV缓存压缩对大型语言模型核心能力的影响,并介绍ShotKV方法,该方法在高压缩率下显著提升长文本生成性能。
This paper introduces a benchmark to assess how KV cache compression affects core LLM abilities and proposes ShotKV, a method that improves long-context generation performance under high compression.

Authors:Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu
Title: ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference
Abstract:
Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce ChunkKV, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguistic structures and contextual integrity, ensuring that essential meaning is retained even under aggressive compression. Our innovation includes a novel layer-wise index reuse technique that exploits the higher cross-layer similarity of preserved indices in ChunkKV, reducing computational overhead and improving throughput by 26.5\%. Comprehensive evaluations on challenging benchmarks: LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV demonstrate that ChunkKV outperforms state-of-the-art methods by up to 8.7\% in precision while maintaining the same compression ratio. These results confirm that semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing a simple yet effective solution to the memory bottleneck problem.
中文摘要:ChunkKV提出了一种语义感知的KV缓存压缩方法,将语义块而非单个词元作为基本压缩单元,在保持上下文完整性的同时,将长文本推理性能提升高达8.7%,吞吐量提高26.5%。
English Summary: ChunkKV introduces a semantic-aware KV cache compression method that treats chunks of tokens as basic units, preserving contextual integrity and improving performance by up to 8.7% while boosting throughput by 26.5% in long-context LLM inference.

Authors:Chin-Chia Michael Yeh, Xiran Fan, Zhimeng Jiang, Yujie Fan, Huiyuan Chen, Uday Singh Saini, Vivian Lai, Xin Dai, Junpeng Wang, Zhongfang Zhuang, Liang Wang, Yan Zheng
Title: UltraSTF: Ultra-Compact Model for Large-Scale Spatio-Temporal Forecasting
Abstract:
Spatio-temporal data, prevalent in real-world applications such as traffic monitoring, financial transactions, and ride-share demands, represents a specialized case of multivariate time series characterized by high dimensionality. This high dimensionality necessitates computationally efficient models and benefits from applying univariate forecasting approaches through channel-independent strategies. SparseTSF, a recently proposed competitive univariate forecasting model, leverages periodicity to achieve compactness by focusing on cross-period dynamics, extending the Pareto frontier in terms of model size and predictive performance. However, it underperforms on spatio-temporal data due to limited capture of intra-period temporal dependencies. To address this limitation, we propose UltraSTF, which integrates a cross-period forecasting component with an ultra-compact shape bank component. Our model efficiently captures recurring patterns in time series using the attention mechanism of the shape bank component, significantly enhancing its capability to learn intra-period dynamics. UltraSTF achieves state-of-the-art performance on the LargeST benchmark while utilizing fewer than 0.2% of the parameters required by the second-best methods, thereby further extending the Pareto frontier of existing approaches.
Chinese: UltraSTF通过整合跨周期预测组件与超紧凑形态库组件,有效解决了SparseTSF在周期内时序依赖捕捉不足的问题,仅用不到0.2%的参数就在LargeST基准上实现了最优性能。
English: UltraSTF overcomes SparseTSF's limitations in capturing intra-period dependencies by combining cross-period forecasting with an ultra-compact shape bank, achieving state-of-the-art performance on LargeST with under 0.2% of the parameters of competing methods.

Authors:Haicheng Liao, Chengyue Wang, Kaiqun Zhu, Yilong Ren, Bolin Gao, Shengbo Eben Li, Chengzhong Xu, Zhenning Li
Title: Minds on the Move: Decoding Trajectory Prediction in Autonomous Driving with Cognitive Insights
Abstract:
In mixed autonomous driving environments, accurately predicting the future trajectories of surrounding vehicles is crucial for the safe operation of autonomous vehicles (AVs). In driving scenarios, a vehicle's trajectory is determined by the decision-making process of human drivers. However, existing models primarily focus on the inherent statistical patterns in the data, often neglecting the critical aspect of understanding the decision-making processes of human drivers. This oversight results in models that fail to capture the true intentions of human drivers, leading to suboptimal performance in long-term trajectory prediction. To address this limitation, we introduce a Cognitive-Informed Transformer (CITF) that incorporates a cognitive concept, Perceived Safety, to interpret drivers' decision-making mechanisms. Perceived Safety encapsulates the varying risk tolerances across drivers with different driving behaviors. Specifically, we develop a Perceived Safety-aware Module that includes a Quantitative Safety Assessment for measuring the subject risk levels within scenarios, and Driver Behavior Profiling for characterizing driver behaviors. Furthermore, we present a novel module, Leanformer, designed to capture social interactions among vehicles. CITF demonstrates significant performance improvements on three well-established datasets. In terms of long-term prediction, it surpasses existing benchmarks by 12.0% on the NGSIM, 28.2% on the HighD, and 20.8% on the MoCAD dataset. Additionally, its robustness in scenarios with limited or missing data is evident, surpassing most state-of-the-art (SOTA) baselines, and paving the way for real-world applications.
Chinese: 本文提出了一种认知信息转换器(CITF),通过引入感知安全概念来模拟人类驾驶员的决策机制,在多个数据集上显著提升了长期轨迹预测的准确性和鲁棒性。
English: This paper introduces a Cognitive-Informed Transformer (CITF) that integrates Perceived Safety to model human drivers' decision-making, significantly improving long-term trajectory prediction accuracy and robustness across multiple datasets.

Authors:Dong Liu, Juan S. Giraldo, Peter Palensky, Pedro P. Vergara
Title: Model-Free Privacy Preserving Power Flow Analysis in Distribution Networks
Abstract:
Model-free power flow calculation, driven by the rise of smart meter (SM) data and the lack of network topology, often relies on artificial intelligence neural networks (ANNs). However, training ANNs require vast amounts of SM data, posing privacy risks for households in distribution networks. To ensure customers' privacy during the SM data gathering and online sharing, we introduce a privacy preserving PF calculation framework, composed of two local strategies: a local randomisation strategy (LRS) and a local zero-knowledge proof (ZKP)-based data collection strategy. First, the LRS is used to achieve irreversible transformation and robust privacy protection for active and reactive power data, thereby ensuring that personal data remains confidential. Subsequently, the ZKP-based data collecting strategy is adopted to securely gather the training dataset for the ANN, enabling SMs to interact with the distribution system operator without revealing the actual voltage magnitude. Moreover, to mitigate the accuracy loss induced by the seasonal variations in load profiles, an incremental learning strategy is incorporated into the online application. The results across three datasets with varying measurement errors demonstrate that the proposed framework efficiently collects one month of SM data within one hour. Furthermore, it robustly maintains mean errors of 0.005 p.u. and 0.014 p.u. under multiple measurement errors and seasonal variations in load profiles, respectively.
中文: 本文提出了一种隐私保护的电力潮流计算框架,采用本地随机化和零知识证明策略,在保护用户隐私的前提下安全采集智能电表数据训练神经网络,并通过增量学习保持计算精度,实现了高效数据采集和稳定的误差控制。
English: This paper introduces a privacy-preserving power flow calculation framework that uses local randomization and zero-knowledge proof strategies to securely collect smart meter data for training neural networks while maintaining accuracy through incremental learning, achieving efficient data collection and robust error control.

Authors:Tian Yu Liu, Alessandro Achille, Matthew Trager, Aditya Golatkar, Luca Zancato, Stefano Soatto
Title: PICASO: Permutation-Invariant Context Composition with State Space Models
Abstract:
Providing Large Language Models with relevant contextual knowledge at inference time has been shown to greatly improve the quality of their generations. This is often achieved by prepending informative passages of text, or 'contexts', retrieved from external knowledge bases to their input. However, processing additional contexts online incurs significant computation costs that scale with their length. State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states from which to start the generation. A key challenge arises when attempting to leverage information present across multiple contexts, since there is no straightforward way to condition generation on multiple independent states in existing SSMs. To address this, we leverage a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating raw context tokens. Since the temporal ordering of contexts can often be uninformative, we enforce permutation-invariance by efficiently averaging states obtained via our composition algorithm across all possible context orderings. We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
中文: 状态空间模型通过将多个上下文状态组合成单一的置换不变表示,实现了高效生成,在WikiText和MSMARCO数据集上达到与最优基线相当的性能,同时平均提速5.4倍。
English: State Space Models enable efficient generation by composing multiple contextual states into a single permutation-invariant representation, achieving performance comparable to top baselines with a 5.4x average speedup on datasets like WikiText and MSMARCO.

Authors:Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, Hua Wei
Title: Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology
Abstract:
Understanding the uncertainty in large language model (LLM) explanations is important for evaluating their faithfulness and reasoning consistency, and thus provides insights into the reliability of LLM's output regarding a question. In this work, we propose a novel framework that quantifies uncertainty in LLM explanations through a reasoning topology perspective. By designing a structural elicitation strategy, we guide the LLMs to frame the explanations of an answer into a graph topology. This process decomposes the explanations into the knowledge related sub-questions and topology-based reasoning structures, which allows us to quantify uncertainty not only at the semantic level but also from the reasoning path. It further brings convenience to assess knowledge redundancy and provide interpretable insights into the reasoning process. Our method offers a systematic way to interpret the LLM reasoning, analyze limitations, and provide guidance for enhancing robustness and faithfulness. This work pioneers the use of graph-structured uncertainty measurement in LLM explanations and demonstrates the potential of topology-based quantification.
中文: 本研究提出了一种新颖框架,通过将大语言模型的解释构建为推理图谱来量化其不确定性,实现了语义和推理路径的多层次可靠性评估,并为推理过程提供了可解释的洞察。
English: This study introduces a novel framework that quantifies uncertainty in large language model explanations by structuring them into reasoning graphs, enabling multi-level assessment of semantic and path-based reliability while providing interpretable insights into the reasoning process.

Authors:Tiejin Chen, Xiaoou Liu, Longchao Da, Jia Chen, Vagelis Papalexakis, Hua Wei
Title: Uncertainty Quantification of Large Language Models through Multi-Dimensional Responses
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks due to large training datasets and powerful transformer architecture. However, the reliability of responses from LLMs remains a question. Uncertainty quantification (UQ) of LLMs is crucial for ensuring their reliability, especially in areas such as healthcare, finance, and decision-making. Existing UQ methods primarily focus on semantic similarity, overlooking the deeper knowledge dimensions embedded in responses. We introduce a multi-dimensional UQ framework that integrates semantic and knowledge-aware similarity analysis. By generating multiple responses and leveraging auxiliary LLMs to extract implicit knowledge, we construct separate similarity matrices and apply tensor decomposition to derive a comprehensive uncertainty representation. This approach disentangles overlapping information from both semantic and knowledge dimensions, capturing both semantic variations and factual consistency, leading to more accurate UQ. Our empirical evaluations demonstrate that our method outperforms existing techniques in identifying uncertain responses, offering a more robust framework for enhancing LLM reliability in high-stakes applications.
Chinese: 本文提出了一种多维度不确定性量化框架,通过整合语义与知识感知的相似性分析,分解重叠信息并同时捕捉语义变化与事实一致性,从而更准确地评估大语言模型响应的可靠性。
English: This paper introduces a multi-dimensional uncertainty quantification framework that combines semantic and knowledge-aware similarity analysis to more accurately assess the reliability of large language model responses by disentangling overlapping information and capturing both semantic variations and factual consistency.

Authors:Yehong Huang, Chen Zhao, Rochak Dhakal, Min Zhao, Guang-Uei Hung, Zhixin Jiang, Weihua Zhou
Title: FedDA-TSformer: Federated Domain Adaptation with Vision TimeSformer for Left Ventricle Segmentation on Gated Myocardial Perfusion SPECT Image
Abstract:
Background and Purpose: Functional assessment of the left ventricle using gated myocardial perfusion (MPS) single-photon emission computed tomography relies on the precise extraction of the left ventricular contours while simultaneously ensuring the security of patient data. Methods: In this paper, we introduce the integration of Federated Domain Adaptation with TimeSformer, named 'FedDA-TSformer' for left ventricle segmentation using MPS. FedDA-TSformer captures spatial and temporal features in gated MPS images, leveraging spatial attention, temporal attention, and federated learning for improved domain adaptation while ensuring patient data security. In detail, we employed Divide-Space-Time-Attention mechanism to extract spatio-temporal correlations from the multi-centered MPS datasets, ensuring that predictions are spatio-temporally consistent. To achieve domain adaptation, we align the model output on MPS from three different centers using local maximum mean discrepancy (LMMD) loss. This approach effectively addresses the dual requirements of federated learning and domain adaptation, enhancing the model's performance during training with multi-site datasets while ensuring the protection of data from different hospitals. Results: Our FedDA-TSformer was trained and evaluated using MPS datasets collected from three hospitals, comprising a total of 150 subjects. Each subject's cardiac cycle was divided into eight gates. The model achieved Dice Similarity Coefficients (DSC) of 0.842 and 0.907 for left ventricular (LV) endocardium and epicardium segmentation, respectively. Conclusion: Our proposed FedDA-TSformer model addresses the challenge of multi-center generalization, ensures patient data privacy protection, and demonstrates effectiveness in left ventricular (LV) segmentation.
中文: FedDA-TSformer模型结合联邦学习与时空注意力机制,在保护患者数据隐私的同时,实现了多中心心肌灌注影像中左心室的精准分割。
English: The FedDA-TSformer model integrates federated learning with spatio-temporal attention to achieve accurate left ventricle segmentation from multi-center MPS data while ensuring patient privacy.

Authors:Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham
Title: Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments
Abstract:
Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60\%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments.
中文摘要:本研究提出了一种新颖的目标说话人提取方法,利用包含正负片段的噪声注册音频从混合语音中分离目标说话人,通过两阶段训练策略实现了最优性能并显著加速了模型收敛。
English Summary: This study introduces a novel target speaker extraction method that uses noisy enrollment audio with positive and negative segments to isolate a speaker's voice from a mixture, achieving state-of-the-art performance and faster convergence through a two-stage training strategy.

Authors:Bizhu Wang, Zhiqiang Bian, Yue Chen, Xiaodong Xu, Chen Sun, Wenqi Zhang, Ping Zhang
Title: Efficient Semantic-aware Encryption for Secure Communications in Intelligent Connected Vehicles
Abstract:
Semantic communication (SemCom) significantly improves inter-vehicle interactions in intelligent connected vehicles (ICVs) within limited wireless spectrum. However, the open nature of wireless communications introduces eavesdropping risks. To mitigate this, we propose the Efficient Semantic-aware Encryption (ESAE) mechanism, integrating cryptography into SemCom to secure semantic transmission without complex key management. ESAE leverages semantic reciprocity between source and reconstructed information from past communications to independently generate session keys at both ends, reducing key transmission costs and associated security risks. Additionally, ESAE introduces a semantic-aware key pre-processing method (SA-KP) using the YOLO-v10 model to extract consistent semantics from bit-level diverse yet semantically identical content, ensuring key consistency. Experimental results validate ESAE's effectiveness and feasibility under various wireless conditions, with key performance factors discussed.
中文:提出的高效语义感知加密(ESAE)机制通过语义互易性实现独立会话密钥生成,并采用基于YOLO-v10的预处理方法,在保障智能网联车语义通信安全的同时避免了复杂密钥管理,且在各种无线条件下均保持良好性能。
English: The proposed Efficient Semantic-aware Encryption (ESAE) mechanism secures semantic communication for intelligent connected vehicles by enabling independent session key generation through semantic reciprocity and a YOLO-v10-based preprocessing method, eliminating complex key management while maintaining performance across wireless conditions.

Authors:Mihir Parmar, Xin Liu, Palash Goyal, Yanfei Chen, Long Le, Swaroop Mishra, Hossein Mobahi, Jindong Gu, Zifeng Wang, Hootan Nakhost, Chitta Baral, Chen-Yu Lee, Tomas Pfister, Hamid Palangi
Title: PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
Abstract:
Recent agent frameworks and inference-time algorithms often struggle with complex planning problems due to limitations in verifying generated plans or reasoning and varying complexity of instances within a single task. Many existing methods for these tasks either perform task-level verification without considering constraints or apply inference-time algorithms without adapting to instance-level complexity. To address these limitations, we propose PlanGEN, a model-agnostic and easily scalable agent framework with three key components: constraint, verification, and selection agents. Specifically, our approach proposes constraint-guided iterative verification to enhance performance of inference-time algorithms--Best of N, Tree-of-Thought, and REBASE. In PlanGEN framework, the selection agent optimizes algorithm choice based on instance complexity, ensuring better adaptability to complex planning problems. Experimental results demonstrate significant improvements over the strongest baseline across multiple benchmarks, achieving state-of-the-art results on NATURAL PLAN ($\sim$8%$\uparrow$), OlympiadBench ($\sim$4%$\uparrow$), DocFinQA ($\sim$7%$\uparrow$), and GPQA ($\sim$1%$\uparrow$). Our key finding highlights that constraint-guided iterative verification improves inference-time algorithms, and adaptive selection further boosts performance on complex planning and reasoning problems.
Chinese: PlanGEN是一个可扩展的智能体框架,通过约束引导的迭代验证和自适应算法选择优化推理时算法,在多个基准测试中取得了最先进的性能提升。
English: PlanGEN is a scalable agent framework that enhances inference-time algorithms through constraint-guided iterative verification and adaptive algorithm selection, achieving state-of-the-art results across multiple benchmarks.

Authors:Andrea Busto-Castiñeira, Silvia García-Méndez, Francisco de Arriba-Pérez, Francisco J. González-Castaño
Title: Optimal word order for non-causal text generation with Large Language Models: the Spanish case
Abstract:
Natural Language Generation (NLG) popularity has increased owing to the progress in Large Language Models (LLMs), with zero-shot inference capabilities. However, most neural systems utilize decoder-only causal (unidirectional) transformer models, which are effective for English but may reduce the richness of languages with less strict word order, subject omission, or different relative clause attachment preferences. This is the first work that analytically addresses optimal text generation order for non-causal language models. We present a novel Viterbi algorithm-based methodology for maximum likelihood word order estimation. We analyze the non-causal most-likelihood order probability for NLG in Spanish and, then, the probability of generating the same phrases with Spanish causal NLG. This comparative analysis reveals that causal NLG prefers English-like SVO structures. We also analyze the relationship between optimal generation order and causal left-to-right generation order using Spearman's rank correlation. Our results demonstrate that the ideal order predicted by the maximum likelihood estimator is not closely related to the causal order and may be influenced by the syntactic structure of the target sentence.
中文: 本研究提出了一种基于维特比算法的非因果语言模型最优文本生成顺序方法,发现因果模型偏好类似英语的主谓宾结构,且理想词序受句法影响而非与从左到右生成一致。
English: The study introduces a Viterbi-based method to determine optimal text generation order for non-causal language models, revealing that causal models favor English-like SVO structures and that ideal word order is influenced by syntax rather than aligning with left-to-right generation.

Authors:Xiaoou Liu, Zhen Lin, Longchao Da, Chacha Chen, Shubhendu Trivedi, Hua Wei
Title: MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels
Abstract:
Large Language Models (LLMs) require robust confidence estimation, particularly in critical domains like healthcare and law where unreliable outputs can lead to significant consequences. Despite much recent work in confidence estimation, current evaluation frameworks rely on correctness functions -- various heuristics that are often noisy, expensive, and possibly introduce systematic biases. These methodological weaknesses tend to distort evaluation metrics and thus the comparative ranking of confidence measures. We introduce MCQA-Eval, an evaluation framework for assessing confidence measures in Natural Language Generation (NLG) that eliminates dependence on an explicit correctness function by leveraging gold-standard correctness labels from multiple-choice datasets. MCQA-Eval enables systematic comparison of both internal state-based white-box (e.g. logit-based) and consistency-based black-box confidence measures, providing a unified evaluation methodology across different approaches. Through extensive experiments on multiple LLMs and widely used QA datasets, we report that MCQA-Eval provides efficient and more reliable assessments of confidence estimation methods than existing approaches.
中文: MCQA-Eval是一种创新的评估框架,通过利用多项选择题数据集的黄金标准标签,消除了对嘈杂正确性函数的依赖,能够可靠且系统地评估不同方法下大语言模型的置信度估计。
English: MCQA-Eval is a novel evaluation framework that eliminates the need for noisy correctness functions by using gold-standard labels from multiple-choice datasets, enabling reliable and systematic assessment of confidence estimation methods for LLMs across various approaches.

Authors:Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Wei Bi, Yida Xu, Guo Li, Xian Yang
Title: ProMedTS: A Self-Supervised, Prompt-Guided Multimodal Approach for Integrating Medical Text and Time Series
Abstract:
Large language models (LLMs) have shown remarkable performance in vision-language tasks, but their application in the medical field remains underexplored, particularly for integrating structured time series data with unstructured clinical notes. In clinical practice, dynamic time series data, such as lab test results, capture critical temporal patterns, while clinical notes provide rich semantic context. Merging these modalities is challenging due to the inherent differences between continuous signals and discrete text. To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal framework that employs prompt-guided learning to unify these heterogeneous data types. Our approach leverages lightweight anomaly detection to generate anomaly captions that serve as prompts, guiding the encoding of raw time series data into informative prompt embeddings. These prompt embeddings are aligned with textual representations in a shared latent space, preserving fine-grained temporal nuances alongside semantic insights. Furthermore, our framework incorporates tailored self-supervised objectives to enhance both intra- and inter-modal alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world datasets, and the results demonstrate that our method consistently outperforms state-of-the-art approaches.
中文摘要:ProMedTS是一种通过提示引导学习整合结构化时间序列数据和临床文本的自监督多模态框架,在疾病诊断任务中展现出优于现有方法的性能。
English Summary: ProMedTS is a self-supervised framework that integrates structured time series data with clinical notes using prompt-guided learning, demonstrating superior performance in disease diagnosis compared to existing methods.

Authors:Longchao Da, Justin Turnau, Thirulogasankar Pranav Kutralingam, Alvaro Velasquez, Paulo Shakarian, Hua Wei
Title: A Survey of Sim-to-Real Methods in RL: Progress, Prospects and Challenges with Foundation Models
Abstract:
Deep Reinforcement Learning (RL) has been explored and verified to be effective in solving decision-making tasks in various domains, such as robotics, transportation, recommender systems, etc. It learns from the interaction with environments and updates the policy using the collected experience. However, due to the limited real-world data and unbearable consequences of taking detrimental actions, the learning of RL policy is mainly restricted within the simulators. This practice guarantees safety in learning but introduces an inevitable sim-to-real gap in terms of deployment, thus causing degraded performance and risks in execution. There are attempts to solve the sim-to-real problems from different domains with various techniques, especially in the era with emerging techniques such as large foundations or language models that have cast light on the sim-to-real. This survey paper, to the best of our knowledge, is the first taxonomy that formally frames the sim-to-real techniques from key elements of the Markov Decision Process (State, Action, Transition, and Reward). Based on the framework, we cover comprehensive literature from the classic to the most advanced methods including the sim-to-real techniques empowered by foundation models, and we also discuss the specialties that are worth attention in different domains of sim-to-real problems. Then we summarize the formal evaluation process of sim-to-real performance with accessible code or benchmarks. The challenges and opportunities are also presented to encourage future exploration of this direction. We are actively maintaining a repository to include the most up-to-date sim-to-real research work to help domain researchers.
中文摘要:本综述首次提出深度强化学习中仿真到现实技术的分类法,依据马尔可夫决策过程要素系统梳理了从经典方法到基础模型赋能的先进方案,探讨了跨领域应用要点,并持续维护最新研究资源库。
English Summary: This survey presents the first taxonomy for sim-to-real techniques in deep reinforcement learning, categorizing methods by Markov Decision Process elements and covering classic to foundation model-enhanced approaches while discussing domain-specific considerations and maintaining an updated resource repository.

Authors:Xiangyu Li, Yawen Zeng, Xiaofen Xing, Jin Xu, Xiangmin Xu
Title: HedgeAgents: A Balanced-aware Multi-agent Financial Trading System
Abstract:
As automated trading gains traction in the financial market, algorithmic investment strategies are increasingly prominent. While Large Language Models (LLMs) and Agent-based models exhibit promising potential in real-time market analysis and trading decisions, they still experience a significant -20% loss when confronted with rapid declines or frequent fluctuations, impeding their practical application. Hence, there is an imperative to explore a more robust and resilient framework. This paper introduces an innovative multi-agent system, HedgeAgents, aimed at bolstering system robustness via ``hedging'' strategies. In this well-balanced system, an array of hedging agents has been tailored, where HedgeAgents consist of a central fund manager and multiple hedging experts specializing in various financial asset classes. These agents leverage LLMs' cognitive capabilities to make decisions and coordinate through three types of conferences. Benefiting from the powerful understanding of LLMs, our HedgeAgents attained a 70% annualized return and a 400% total return over a period of 3 years. Moreover, we have observed with delight that HedgeAgents can even formulate investment experience comparable to those of human experts (https://hedgeagents.github.io/).
中文: 本文提出的HedgeAgents多智能体系统采用大语言模型和对冲策略,在三年间实现70%年化收益,既能有效抵御市场波动,又能生成媲美人类专家的投资经验。
English: This paper introduces HedgeAgents, a multi-agent system using LLMs and hedging strategies that achieved a 70% annual return over three years, demonstrating robustness against market volatility while generating expert-level investment insights.

Authors:Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig
Title: Interactive Agents to Overcome Ambiguity in Software Engineering
Abstract:
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes, safety risks due to tool misuse, and wasted computational resources. In this work, we study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance across three key steps: (a) leveraging interactivity to improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c) asking targeted questions. Our findings reveal that models struggle to distinguish between well-specified and underspecified instructions. However, when models interact for underspecified inputs, they effectively obtain vital information from the user, leading to significant improvements in performance and underscoring the value of effective interaction. Our study highlights critical gaps in how current state-of-the-art models handle ambiguity in complex software engineering tasks and structures the evaluation into distinct steps to enable targeted improvements.
Chinese: AI代理在处理模糊用户指令时易产生误解,但通过交互式沟通能显著提升其性能,使其能够识别模糊之处并寻求澄清,然而现有模型在区分明确与不明确任务方面仍有不足。
English: AI agents often misinterpret ambiguous user instructions, but interactive engagement significantly enhances their performance by enabling them to detect ambiguities and seek clarifications, though current models still struggle with distinguishing well-specified from underspecified tasks.

Authors:Shen Han, Zhiyao Zhou, Jiawei Chen, Zhezheng Hao, Sheng Zhou, Gang Wang, Yan Feng, Chun Chen, Can Wang
Title: Uncertainty-Aware Graph Structure Learning
Abstract:
Graph Neural Networks (GNNs) have become a prominent approach for learning from graph-structured data. However, their effectiveness can be significantly compromised when the graph structure is suboptimal. To address this issue, Graph Structure Learning (GSL) has emerged as a promising technique that refines node connections adaptively. Nevertheless, we identify two key limitations in existing GSL methods: 1) Most methods primarily focus on node similarity to construct relationships, while overlooking the quality of node information. Blindly connecting low-quality nodes and aggregating their ambiguous information can degrade the performance of other nodes. 2) The constructed graph structures are often constrained to be symmetric, which may limit the model's flexibility and effectiveness. To overcome these limitations, we propose an Uncertainty-aware Graph Structure Learning (UnGSL) strategy. UnGSL estimates the uncertainty of node information and utilizes it to adjust the strength of directional connections, where the influence of nodes with high uncertainty is adaptively reduced. Importantly, UnGSL serves as a plug-in module that can be seamlessly integrated into existing GSL methods with minimal additional computational cost. In our experiments, we implement UnGSL into six representative GSL methods, demonstrating consistent performance improvements.
中文: 图神经网络在处理次优图结构时存在局限,现有图结构学习方法因忽视节点信息质量和强制对称连接而效果受限,为此提出不确定性感知图结构学习(UnGSL),通过自适应降低高不确定性节点影响作为即插即用模块,实现稳定性能提升。
English: Graph Neural Networks face limitations from suboptimal graph structures, which existing Graph Structure Learning methods inadequately address by overlooking node information quality and enforcing symmetric connections, prompting the proposed Uncertainty-aware Graph Structure Learning (UnGSL) that adaptively reduces influence from uncertain nodes as a plug-in module for consistent performance gains.

Authors:Bowei He, Lihao Yin, Hui-Ling Zhen, Xiaokun Zhang, Mingxuan Yuan, Chen Ma
Title: PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery
Abstract:
Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the \textbf{P}ost-training d\textbf{A}ta \textbf{S}election method for \textbf{E}fficient pruned large language model \textbf{R}ecovery (\textbf{PASER}). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4\%-20\% of the original post-training data. We provide the anonymous code repository in \href{https://anonymous.4open.science/r/PASER-E606}{Link}.
中文摘要:PASER是一种高效的训练后数据选择方法,通过基于语义聚类和自适应预算分配策略,有针对性地选择恢复指令来修复剪枝后大语言模型的能力,仅需4%-20%的原始数据即可显著提升模型性能。
English Summary: PASER is an efficient post-training data selection method that recovers pruned large language models' capabilities by strategically selecting and allocating instructions based on their impact and relevance, achieving superior performance with only 4%-20% of the original data.

Authors:Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar
Title: Scaling Test-Time Compute Without Verification or RL is Suboptimal
Abstract:
Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time budget grows. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
中文: 本研究证明在扩展测试时计算中,基于验证器的方法明显优于无验证器方法,且当预训练大语言模型呈现异构解决方案分布时,随着计算资源增加,两者性能差距会进一步扩大。
English: This study demonstrates that verifier-based methods significantly outperform verifier-free approaches in scaling test-time compute, with the performance gap widening as computational resources increase, particularly when pre-trained LLMs exhibit heterogeneous solution distributions.

Authors:Pramuditha Perera, Matthew Trager, Luca Zancato, Alessandro Achille, Stefano Soatto
Title: Descriminative-Generative Custom Tokens for Vision-Language Models
Abstract:
This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retrieval task, and also have the important benefit that composite queries can be visualized to ensure that the desired concept is faithfully encoded. Based on this, we introduce the method of Generation Aided Image Retrieval, where the query is modified at inference time to better suit the search intent. On the DeepFashion2 dataset, our method improves Mean Reciprocal Retrieval (MRR) over relevant baselines by 7%.
中文摘要:本文提出一种在视觉语言模型中学习自定义标记的方法,通过结合图像和文本来有效表示新概念,在生成任务和文本到图像检索中的性能提升了7%。
English Summary: This paper introduces a method for learning custom tokens in Vision-Language Models that effectively represent new concepts using both images and text, improving performance in generative tasks and text-to-image retrieval by 7% on benchmark datasets.

Authors:Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao
Title: MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation
Abstract:
Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to say no. To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.
中文摘要:本文提出MMRC基准来评估多模态大语言模型的开放式对话能力,揭示了现有模型的性能缺陷,并提出一种简单有效的笔记记录策略,显著提升了模型表现。
English Summary: This paper introduces the MMRC benchmark to evaluate multimodal large language models' open-ended conversation abilities, revealing performance issues and proposing a NOTE-TAKING strategy that significantly improves model performance.

Authors:Zaitian Wang, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu, Pengfei Wang, Yuanchun Zhou
Title: Diversity-oriented Data Augmentation with Large Language Models
Abstract:
Data augmentation is an essential technique in natural language processing (NLP) for enriching training datasets by generating diverse samples. This process is crucial for improving the robustness and generalization capabilities of NLP models. However, a significant challenge remains: \textit{Insufficient Attention to Sample Distribution Diversity}. Most existing methods focus on increasing the sample numbers while neglecting the sample distribution diversity, which can lead to model overfitting. In response, we explore data augmentation's impact on dataset diversity and propose a \textbf{\underline{D}}iversity-\textbf{\underline{o}}riented data \textbf{\underline{Aug}}mentation framework (\textbf{DoAug}). % \(\mathscr{DoAug}\) Specifically, we utilize a diversity-oriented fine-tuning approach to train an LLM as a diverse paraphraser, which is capable of augmenting textual datasets by generating diversified paraphrases. Then, we apply the LLM paraphraser to a selected coreset of highly informative samples and integrate the paraphrases with the original data to create a more diverse augmented dataset. Finally, we conduct extensive experiments on 12 real-world textual datasets. The results show that our fine-tuned LLM augmenter improves diversity while preserving label consistency, thereby enhancing the robustness and performance of downstream tasks. Specifically, it achieves an average performance gain of \(10.52\%\), surpassing the runner-up baseline with more than three percentage points.
中文摘要:数据增强在自然语言处理中至关重要,但现有方法常忽视样本分布多样性,导致模型过拟合;提出的DoAug框架通过微调大型语言模型生成多样化复述,有效提升数据集多样性,使下游任务性能平均提高超过10%。
English Summary: Data augmentation is vital in NLP for enhancing model robustness, yet current methods often overlook sample distribution diversity, leading to overfitting; the proposed DoAug framework addresses this by using a fine-tuned LLM to generate diverse paraphrases, significantly improving dataset diversity and boosting downstream task performance by over 10%.

Authors:Ruichao Yang, Jing Ma, Wei Gao, Hongzhan Lin
Title: LLM-Enhanced Multiple Instance Learning for Joint Rumor and Stance Detection with Social Context Information
Abstract:
The proliferation of misinformation, such as rumors on social media, has drawn significant attention, prompting various expressions of stance among users. Although rumor detection and stance detection are distinct tasks, they can complement each other. Rumors can be identified by cross-referencing stances in related posts, and stances are influenced by the nature of the rumor. However, existing stance detection methods often require post-level stance annotations, which are costly to obtain. We propose a novel LLM-enhanced MIL approach to jointly predict post stance and claim class labels, supervised solely by claim labels, using an undirected microblog propagation model. Our weakly supervised approach relies only on bag-level labels of claim veracity, aligning with multi-instance learning (MIL) principles. To achieve this, we transform the multi-class problem into multiple MIL-based binary classification problems. We then employ a discriminative attention layer to aggregate the outputs from these classifiers into finer-grained classes. Experiments conducted on three rumor datasets and two stance datasets demonstrate the effectiveness of our approach, highlighting strong connections between rumor veracity and expressed stances in responding posts. Our method shows promising performance in joint rumor and stance detection compared to the state-of-the-art methods.
中文摘要:本研究提出了一种新颖的LLM增强多示例学习方法,仅使用声明级监督即可联合检测谣言真实性和用户立场,通过利用谣言性质与表达立场之间的内在联系,在多个数据集上展现出优越性能。
English Summary: This study introduces a novel LLM-enhanced multi-instance learning approach that jointly detects rumor veracity and user stances using only claim-level supervision, demonstrating strong performance across multiple datasets by leveraging the intrinsic connection between rumor nature and expressed stances.

Authors:Xuanze Chen, Jiajun Zhou, Jinsong Chen, Shanqing Yu, Qi Xuan
Title: Mixture of Decoupled Message Passing Experts with Entropy Constraint for General Node Classification
Abstract:
The varying degrees of homophily and heterophily in real-world graphs persistently constrain the universality of graph neural networks (GNNs) for node classification. Adopting a data-centric perspective, this work reveals an inherent preference of different graphs towards distinct message encoding schemes: homophilous graphs favor local propagation, while heterophilous graphs exhibit preference for flexible combinations of propagation and transformation. To address this, we propose GNNMoE, a universal node classification framework based on the Mixture-of-Experts (MoE) mechanism. The framework first constructs diverse message-passing experts through recombination of fine-grained encoding operators, then designs soft and hard gating layers to allocate the most suitable expert networks for each node's representation learning, thereby enhancing both model expressiveness and adaptability to diverse graphs. Furthermore, considering that soft gating might introduce encoding noise in homophilous scenarios, we introduce an entropy constraint to guide sharpening of soft gates, achieving organic integration of weighted combination and Top-K selection. Extensive experiments demonstrate that GNNMoE significantly outperforms mainstream GNNs, heterophilous GNNs, and graph transformers in both node classification performance and universality across diverse graph datasets.
中文摘要:本研究提出GNNMoE框架,通过混合专家机制自适应组合消息传递专家,有效提升模型在不同同质性和异质性图数据上的节点分类性能与普适性。
English Summary: This study introduces GNNMoE, a universal node classification framework that leverages the Mixture-of-Experts mechanism to adaptively combine message-passing experts for enhanced performance across diverse graph types with varying homophily and heterophily levels.

Authors:Bowei He, Lihao Yin, Hui-Ling Zhen, Jianping Zhang, Lanqing Hong, Mingxuan Yuan, Chen Ma
Title: Certifying Language Model Robustness with Fuzzed Randomized Smoothing: An Efficient Defense Against Backdoor Attacks
Abstract:
The widespread deployment of pre-trained language models (PLMs) has exposed them to textual backdoor attacks, particularly those planted during the pre-training stage. These attacks pose significant risks to high-reliability applications, as they can stealthily affect multiple downstream tasks. While certifying robustness against such threats is crucial, existing defenses struggle with the high-dimensional, interdependent nature of textual data and the lack of access to original poisoned pre-training data. To address these challenges, we introduce \textbf{F}uzzed \textbf{R}andomized \textbf{S}moothing (\textbf{FRS}), a novel approach for efficiently certifying language model robustness against backdoor attacks. FRS integrates software robustness certification techniques with biphased model parameter smoothing, employing Monte Carlo tree search for proactive fuzzing to identify vulnerable textual segments within the Damerau-Levenshtein space. This allows for targeted and efficient text randomization, while eliminating the need for access to poisoned training data during model smoothing. Our theoretical analysis demonstrates that FRS achieves a broader certified robustness radius compared to existing methods. Extensive experiments across various datasets, model configurations, and attack strategies validate FRS's superiority in terms of defense efficiency, accuracy, and robustness.
中文摘要:FRS是一种创新方法,通过结合软件认证技术与参数平滑及主动模糊测试,无需中毒训练数据即可有效认证语言模型对后门攻击的鲁棒性,并在防御效率、准确性和鲁棒性方面表现卓越。
English Summary: FRS is a novel method that efficiently certifies language model robustness against backdoor attacks by combining software certification techniques with parameter smoothing and proactive fuzzing, eliminating the need for poisoned training data while achieving superior defense performance.

Authors:Jinhao Duan, Xinyu Zhao, Zhuoxuan Zhang, Eunhye Ko, Lily Boddy, Chenan Wang, Tianhao Li, Alexander Rasgon, Junyuan Hong, Min Kyung Lee, Chenxi Yuan, Qi Long, Ying Ding, Tianlong Chen, Kaidi Xu
Title: GuideLLM: Exploring LLM-Guided Conversation with Applications in Autobiography Interviewing
Abstract:
Although Large Language Models (LLMs) succeed in human-guided conversations such as instruction following and question answering, the potential of LLM-guided conversations-where LLMs direct the discourse and steer the conversation's objectives-remains under-explored. In this study, we first characterize LLM-guided conversation into three fundamental components: (i) Goal Navigation; (ii) Context Management; (iii) Empathetic Engagement, and propose GuideLLM as an installation. We then implement an interviewing environment for the evaluation of LLM-guided conversation. Specifically, various topics are involved in this environment for comprehensive interviewing evaluation, resulting in around 1.4k turns of utterances, 184k tokens, and over 200 events mentioned during the interviewing for each chatbot evaluation. We compare GuideLLM with 6 state-of-the-art LLMs such as GPT-4o and Llama-3-70b-Instruct, from the perspective of interviewing quality, and autobiography generation quality. For automatic evaluation, we derive user proxies from multiple autobiographies and employ LLM-as-a-judge to score LLM behaviors. We further conduct a human-involved experiment by employing 45 human participants to chat with GuideLLM and baselines. We then collect human feedback, preferences, and ratings regarding the qualities of conversation and autobiography. Experimental results indicate that GuideLLM significantly outperforms baseline LLMs in automatic evaluation and achieves consistent leading performances in human ratings.
中文摘要:本研究提出GuideLLM探索大语言模型引导对话的新范式,在访谈质量和自传生成的自动评估与人类评分中均显著优于现有先进模型。
English Summary: This research introduces GuideLLM to explore LLM-guided conversations, demonstrating its superior performance over leading models in both automated assessments and human evaluations of interview and autobiography quality.

Authors:Yunchu Han, Zhaojun Nan, Sheng Zhou, Zhisheng Niu
Title: DVFS-Aware DNN Inference on GPUs: Latency Modeling and Performance Analysis
Abstract:
The rapid development of deep neural networks (DNNs) is inherently accompanied by the problem of high computational costs. To tackle this challenge, dynamic voltage frequency scaling (DVFS) is emerging as a promising technology for balancing the latency and energy consumption of DNN inference by adjusting the computing frequency of processors. However, most existing models of DNN inference time are based on the CPU-DVFS technique, and directly applying the CPU-DVFS model to DNN inference on GPUs will lead to significant errors in optimizing latency and energy consumption. In this paper, we propose a DVFS-aware latency model to precisely characterize DNN inference time on GPUs. We first formulate the DNN inference time based on extensive experiment results for different devices and analyze the impact of fitting parameters. Then by dividing DNNs into multiple blocks and obtaining the actual inference time, the proposed model is further verified. Finally, we compare our proposed model with the CPU-DVFS model in two specific cases. Evaluation results demonstrate that local inference optimization with our proposed model achieves a reduction of no less than 66% and 69% in inference time and energy consumption respectively. In addition, cooperative inference with our proposed model can improve the partition policy and reduce the energy consumption compared to the CPU-DVFS model.
Chinese: 本文提出了一种针对GPU的DVFS感知延迟模型,相比传统CPU模型,该模型在局部推理优化中能分别降低至少66%的推理时间和69%的能耗,并通过协同推理进一步优化分区策略。
English: This paper introduces a GPU-specific DVFS-aware latency model that significantly improves DNN inference efficiency by reducing both inference time and energy consumption by at least 66% and 69% respectively, outperforming traditional CPU-based models.

Authors:Wanqi Yang, Yanda Li, Meng Fang, Ling Chen
Title: MTPChat: A Multimodal Time-Aware Persona Dataset for Conversational Agents
Abstract:
Understanding temporal dynamics is critical for conversational agents, enabling effective content analysis and informed decision-making. However, time-aware datasets, particularly for persona-grounded conversations, are still limited, which narrows their scope and diminishes their complexity. To address this gap, we introduce MTPChat, a multimodal, time-aware persona dialogue dataset that integrates linguistic, visual, and temporal elements within dialogue and persona memory. Leveraging MTPChat, we propose two time-sensitive tasks: Temporal Next Response Prediction (TNRP) and Temporal Grounding Memory Prediction (TGMP), both designed to assess a model's ability to understand implicit temporal cues and dynamic interactions. Additionally, we present an innovative framework featuring an adaptive temporal module to effectively integrate multimodal streams and capture temporal dependencies. Experimental results validate the challenges posed by MTPChat and demonstrate the effectiveness of our framework in multimodal time-sensitive scenarios.
中文摘要:本研究提出MTPChat多模态时序感知对话数据集,通过设计两项时序敏感任务和自适应框架,解决了对话系统中时序数据匮乏的问题,有效提升了动态交互的建模能力。
English Summary: The study introduces MTPChat, a multimodal time-aware dialogue dataset, and proposes two time-sensitive tasks along with an adaptive framework to address the scarcity of temporal data in conversational AI, demonstrating improved handling of dynamic interactions.

Authors:Ziqi Ding, Gelei Deng, Yi Liu, Junchen Ding, Jieshan Chen, Yulei Sui, Yuekang Li
Title: IllusionCAPTCHA: A CAPTCHA based on Visual Illusion
Abstract:
CAPTCHAs have long been essential tools for protecting applications from automated bots. Initially designed as simple questions to distinguish humans from bots, they have become increasingly complex to keep pace with the proliferation of CAPTCHA-cracking techniques employed by malicious actors. However, with the advent of advanced large language models (LLMs), the effectiveness of existing CAPTCHAs is now being undermined. To address this issue, we have conducted an empirical study to evaluate the performance of multimodal LLMs in solving CAPTCHAs and to assess how many attempts human users typically need to pass them. Our findings reveal that while LLMs can solve most CAPTCHAs, they struggle with those requiring complex reasoning type of CAPTCHA that also presents significant challenges for human users. Interestingly, our user study shows that the majority of human participants require a second attempt to pass these reasoning CAPTCHAs, a finding not reported in previous research. Based on empirical findings, we present IllusionCAPTCHA, a novel security mechanism employing the "Human-Easy but AI-Hard" paradigm. This new CAPTCHA employs visual illusions to create tasks that are intuitive for humans but highly confusing for AI models. Furthermore, we developed a structured, step-by-step method that generates misleading options, which particularly guide LLMs towards making incorrect choices and reduce their chances of successfully solving CAPTCHAs. Our evaluation shows that IllusionCAPTCHA can effectively deceive LLMs 100% of the time. Moreover, our structured design significantly increases the likelihood of AI errors when attempting to solve these challenges. Results from our user study indicate that 86.95% of participants successfully passed the CAPTCHA on their first attempt, outperforming other CAPTCHA systems.
中文: 随着大型语言模型的发展,传统验证码面临挑战,因此IllusionCAPTCHA应运而生,它利用视觉错觉有效误导AI,同时保持对人类用户的高度友好性。
English: CAPTCHAs are increasingly vulnerable to advanced large language models, leading to the development of IllusionCAPTCHA, which uses visual illusions to effectively deceive AI while remaining user-friendly for humans.

Authors:Mohammadreza Baharani, Ghazal Alinezhad Noghre, Armin Danesh Pazho, Gabriel Maldonado, Hamed Tabkhi
Title: MoFM: A Large-Scale Human Motion Foundation Model
Abstract:
Foundation Models (FM) have increasingly drawn the attention of researchers due to their scalability and generalization across diverse tasks. Inspired by the success of FMs and the principles that have driven advancements in Large Language Models (LLMs), we introduce MoFM as a novel Motion Foundation Model. MoFM is designed for the semantic understanding of complex human motions in both time and space. To facilitate large-scale training, MotionBook, a comprehensive human motion dictionary of discretized motions is designed and employed. MotionBook utilizes Thermal Cubes to capture spatio-temporal motion heatmaps, applying principles from discrete variational models to encode human movements into discrete units for a more efficient and scalable representation. MoFM, trained on a large corpus of motion data, provides a foundational backbone adaptable to diverse downstream tasks, supporting paradigms such as one-shot, unsupervised, and supervised tasks. This versatility makes MoFM well-suited for a wide range of motion-based applications.
中文摘要:MoFM是一种新颖的运动基础模型,旨在从时间和空间维度理解复杂人体运动的语义,通过名为MotionBook的综合运动词典实现高效表征,并借助大规模训练支持多种下游任务范式。
English Summary: MoFM is a novel Motion Foundation Model designed for semantic understanding of complex human motions in both time and space, utilizing a comprehensive motion dictionary called MotionBook for efficient representation and supporting diverse downstream tasks through large-scale training.

Authors:Yiping Zhang, Yuntao Shou, Wei Ai, Tao Meng, Keqin Li
Title: LRA-GNN: Latent Relation-Aware Graph Neural Network with Initial and Dynamic Residual for Facial Age Estimation
Abstract:
Face information is mainly concentrated among facial key points, and frontier research has begun to use graph neural networks to segment faces into patches as nodes to model complex face representations. However, these methods construct node-to-node relations based on similarity thresholds, so there is a problem that some latent relations are missing. These latent relations are crucial for deep semantic representation of face aging. In this novel, we propose a new Latent Relation-Aware Graph Neural Network with Initial and Dynamic Residual (LRA-GNN) to achieve robust and comprehensive facial representation. Specifically, we first construct an initial graph utilizing facial key points as prior knowledge, and then a random walk strategy is employed to the initial graph for obtaining the global structure, both of which together guide the subsequent effective exploration and comprehensive representation. Then LRA-GNN leverages the multi-attention mechanism to capture the latent relations and generates a set of fully connected graphs containing rich facial information and complete structure based on the aforementioned guidance. To avoid over-smoothing issues for deep feature extraction on the fully connected graphs, the deep residual graph convolutional networks are carefully designed, which fuse adaptive initial residuals and dynamic developmental residuals to ensure the consistency and diversity of information. Finally, to improve the estimation accuracy and generalization ability, progressive reinforcement learning is proposed to optimize the ensemble classification regressor. Our proposed framework surpasses the state-of-the-art baselines on several age estimation benchmarks, demonstrating its strength and effectiveness.
中文摘要:本文提出一种潜在关系感知图神经网络(LRA-GNN),通过多注意力机制和残差连接捕捉面部潜在关系,在多个年龄估计基准测试中超越现有方法,解决了传统图神经网络在面部语义表征中存在的潜在关系缺失问题。
English Summary: This paper introduces a Latent Relation-Aware Graph Neural Network (LRA-GNN) that captures latent facial relations through multi-attention mechanisms and residual connections, achieving state-of-the-art performance in age estimation by overcoming limitations of previous graph-based methods.

Authors:Phillippe Sauter, Thomas Benz, Paul Scheffler, Hannah Pochert, Luisa Wüthrich, Martin Povišer, Beat Muheim, Frank K. Gürkaynak, Luca Benini
Title: Croc: An End-to-End Open-Source Extensible RISC-V MCU Platform to Democratize Silicon
Abstract:
Ensuring a continuous and growing influx of skilled chip designers and a smooth path from education to innovation are key goals for several national and international "Chips Acts". Silicon democratization can greatly benefit from end-to-end (from silicon technology to software) free and open-source (OS) platforms. We present Croc, an extensible RISC-V microcontroller platform explicitly targeted at hands-on teaching and innovation. Croc features a streamlined OS synthesis and an end-to-end OS implementation flow, ensuring full, unconstrained access to the design, the design automation tools, and the implementation technology. Croc uses the industry-proven, open-source CVE2 core, implementing the RV32I(EMC) instruction set architecture (ISA), enabling students to define and implement their own ISA extensions. MLEM, a tapeout of Croc in IHP's open 130 nm node completed in eight weeks by a team of just two students, demonstrates the platform's viability for hands-on teaching in schools, universities, or even on a self-education path. In spring 2025, ETH Zurich will utilize Croc for its curricular VLSI class, involving up to 80 students, producing up to 40 OS application-specific integrated circuit layouts, and completing up to five student-led system-on-chip tapeouts. The lecture notes and exercises are already available under a Creative Commons license.
中文: Croc是一个专为实践教学和创新设计的开源RISC-V微控制器平台,提供从设计到实现的完整开源流程,使学生能自由开发定制芯片扩展,已在学术环境中验证可行性,并计划于2025年在苏黎世联邦理工学院大规模课程应用。
English: Croc is an open-source RISC-V microcontroller platform designed for hands-on education and innovation, featuring a complete end-to-end workflow that enables students to freely design and implement custom chip extensions, with proven success in academic settings and planned large-scale deployment at ETH Zurich in 2025.

Authors:Oleh Rybkin, Michal Nauman, Preston Fu, Charlie Snell, Pieter Abbeel, Sergey Levine, Aviral Kumar
Title: Value-Based Deep RL Scales Predictably
Abstract:
Scaling data and compute is critical to the success of modern ML. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.
中文: 机器学习中的扩展需要从小规模实验中获得可预测的性能,本文通过估计数据与计算分配的帕累托前沿,证明了基于价值的离策略强化学习方法具有可预测性。
English: Scaling in machine learning requires predictable performance from small-scale experiments, and this paper demonstrates that value-based off-policy reinforcement learning methods are predictable by estimating Pareto frontiers for data and compute allocation.

Authors:Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen, Siwei Lyu
Title: DICE: Distilling Classifier-Free Guidance into Text Embeddings
Abstract:
Text-to-image diffusion models are capable of generating high-quality images, but these images often fail to align closely with the given text prompts. Classifier-free guidance (CFG) is a popular and effective technique for improving text-image alignment in the generative process. However, using CFG introduces significant computational overhead and deviates from the established theoretical foundations of diffusion models. In this paper, we present DIstilling CFG by enhancing text Embeddings (DICE), a novel approach that removes the reliance on CFG in the generative process while maintaining the benefits it provides. DICE distills a CFG-based text-to-image diffusion model into a CFG-free version by refining text embeddings to replicate CFG-based directions. In this way, we avoid the computational and theoretical drawbacks of CFG, enabling high-quality, well-aligned image generation at a fast sampling speed. Extensive experiments on multiple Stable Diffusion v1.5 variants, SDXL and PixArt-$α$ demonstrate the effectiveness of our method. Furthermore, DICE supports negative prompts for image editing to improve image quality further. Code will be available soon.
中文: 本文提出DICE方法,通过优化文本嵌入将分类器自由引导(CFG)蒸馏至无需CFG的扩散模型中,在保持文本-图像对齐优势的同时,避免了CFG的计算负担和理论缺陷,实现了高效高质量的图像生成。
English: This paper introduces DICE, a method that distills classifier-free guidance (CFG) into a CFG-free diffusion model by refining text embeddings, enabling efficient, high-quality image generation with improved text alignment while eliminating CFG's computational and theoretical drawbacks.

Authors:Xinglong Sun, Maying Shen, Hongxu Yin, Lei Mao, Pavlo Molchanov, Jose M. Alvarez
Title: Advancing Weight and Channel Sparsification with Enhanced Saliency
Abstract:
Pruning aims to accelerate and compress models by removing redundant parameters, identified by specifically designed importance scores which are usually imperfect. This removal is irreversible, often leading to subpar performance in pruned models. Dynamic sparse training, while attempting to adjust sparse structures during training for continual reassessment and refinement, has several limitations including criterion inconsistency between pruning and growth, unsuitability for structured sparsity, and short-sighted growth strategies. Our paper introduces an efficient, innovative paradigm to enhance a given importance criterion for either unstructured or structured sparsity. Our method separates the model into an active structure for exploitation and an exploration space for potential updates. During exploitation, we optimize the active structure, whereas in exploration, we reevaluate and reintegrate parameters from the exploration space through a pruning and growing step consistently guided by the same given importance criterion. To prepare for exploration, we briefly "reactivate" all parameters in the exploration space and train them for a few iterations while keeping the active part frozen, offering a preview of the potential performance gains from reintegrating these parameters. We show on various datasets and configurations that existing importance criterion even simple as magnitude can be enhanced with ours to achieve state-of-the-art performance and training cost reductions. Notably, on ImageNet with ResNet50, ours achieves an +1.3 increase in Top-1 accuracy over prior art at 90% ERK sparsity. Compared with the SOTA latency pruning method HALP, we reduced its training cost by over 70% while attaining a faster and more accurate pruned model.
中文: 本文提出一种高效范式,通过将模型分为活跃和探索部分来增强剪枝的重要性标准,实现持续优化并以更低训练成本达到顶尖性能。
English: This paper introduces an efficient paradigm that enhances importance criteria for pruning by separating models into active and exploration parts, enabling consistent refinement and achieving state-of-the-art performance with reduced training costs.

Authors:Stavros Orfanoudakis, Peter Palensky, Pedro P. Vergara
Title: Optimizing Electric Vehicles Charging using Large Language Models and Graph Neural Networks
Abstract:
Maintaining grid stability amid widespread electric vehicle (EV) adoption is vital for sustainable transportation. Traditional optimization methods and Reinforcement Learning (RL) approaches often struggle with the high dimensionality and dynamic nature of real-time EV charging, leading to sub-optimal solutions. To address these challenges, this study demonstrates that combining Large Language Models (LLMs), for sequence modeling, with Graph Neural Networks (GNNs), for relational information extraction, not only outperforms conventional EV smart charging methods, but also paves the way for entirely new research directions and innovative solutions.
中文: 本研究证明,将大语言模型与图神经网络相结合,不仅优于传统的电动汽车智能充电方法,还为应对电动汽车普及带来的电网稳定性挑战开辟了全新研究方向。
English: This study shows that integrating Large Language Models with Graph Neural Networks surpasses traditional EV smart charging methods and opens new research avenues for grid stability amid widespread electric vehicle adoption.

Authors:Thomas Benz, Paul Scheffler, Nils Wistoff, Philippe Sauter, Beat Muheim, Luca Benini
Title: ArtistIC: An Open-Source Toolchain for Top-Metal IC Art and Ultra-High-Fidelity GDSII Renders
Abstract:
Open-source projects require outreach material to grow their community, secure funds, and strengthen their influence. Numbers, specifications, and facts alone are intangible to uninvolved people; using a clear brand and appealing visual material is thus ample to reach a broad audience. This is especially true for application-specific integrated circuits (ASICs) during the early stages of the development cycle without running prototype systems. This work presents ArtistIC, an open-source framework to brand ASICs with top-metal art and to render GDSII layouts with ultra-high fidelity reaching render densities below 25 nm/px and gigapixels-scale resolutions.
中文: 开源项目需要清晰的品牌和吸引人的视觉材料来扩大社区和获取资金,尤其在ASIC开发早期缺乏原型时,ArtistIC框架提供了实现高保真顶层金属艺术和GDSII版图渲染的解决方案。
English: Open-source projects need compelling branding and visuals to engage communities and secure funding, especially for ASICs lacking prototypes, so ArtistIC provides an open-source framework for high-fidelity top-metal art and GDSII layout rendering.

Authors:Younan Zhu, Linwei Tao, Minjing Dong, Chang Xu
Title: Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration
Abstract:
Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning capabilities but remain highly susceptible to object hallucination, where models generate responses that are not factually aligned with the visual content. Recent works attribute this issue to an inherent bias of LVLMs where vision token attention map has a fixed correlation with spatial position, and propose to mitigate this issue by reordering visual tokens. However, we find that different LVLMs exhibit different correlations between attention and spatial position, which makes the existing solution difficult to generalize to other LVLMs. To address this issue, we first introduce a training-free solution, Uniform Attention Calibration (UAC), that estimates the bias from single meaningless input image and applies a calibration matrix to rectify attention imbalances. To further alleviate the bias, we relax the assumption of single meaningless input in UAC and introduce a fine-tuning solution, Dynamic Attention Calibration (DAC), that enforces the consistent outputs wherever the object locates in the image via a plug-and-plays module. Comprehensive experiments across multiple benchmarks demonstrate that UAC and DAC significantly reduce object hallucination while improving general multimodal alignment. Our methods achieve state-of-the-art performance across diverse LVLM architectures on various metrics.
中文: 大型视觉语言模型因注意力机制存在偏差而产生物体幻觉问题,而提出的统一注意力校准和动态注意力校准方法通过无需重新训练或微调的方式纠正注意力失衡,显著减少了幻觉现象,并在多个基准测试中取得了最优性能。
English: Large Vision-Language Models suffer from object hallucination due to biased attention mechanisms, but the proposed Uniform Attention Calibration and Dynamic Attention Calibration methods effectively mitigate this issue by rectifying attention imbalances without requiring model retraining or through fine-tuning, achieving state-of-the-art performance across multiple benchmarks.

Authors:Stavros Orfanoudakis, Nanda Kishor Panda, Peter Palensky, Pedro P. Vergara
Title: GNN-DT: Graph Neural Network Enhanced Decision Transformer for Efficient Optimization in Dynamic Environments
Abstract:
Reinforcement Learning (RL) methods used for solving real-world optimization problems often involve dynamic state-action spaces, larger scale, and sparse rewards, leading to significant challenges in convergence, scalability, and efficient exploration of the solution space. This study introduces GNN-DT, a novel Decision Transformer (DT) architecture that integrates Graph Neural Network (GNN) embedders with a novel residual connection between input and output tokens crucial for handling dynamic environments. By learning from previously collected trajectories, GNN-DT tackles the sparse rewards limitations of online RL algorithms and delivers high-quality solutions in real-time. We evaluate GNN-DT on the complex electric vehicle (EV) charging optimization problem and prove that its performance is superior and requires significantly fewer training trajectories, thus improving sample efficiency compared to existing DT and offline RL baselines. Furthermore, GNN-DT exhibits robust generalization to unseen environments and larger action spaces, addressing a critical gap in prior offline and online RL approaches.
中文: GNN-DT是一种新型决策变换器,结合图神经网络和残差连接,有效应对动态环境、稀疏奖励和大规模优化问题,在电动汽车充电场景中展现出卓越性能、样本效率和强大泛化能力。
English: GNN-DT is a novel Decision Transformer that integrates Graph Neural Networks and a residual connection to effectively handle dynamic environments, sparse rewards, and large-scale optimization, demonstrating superior performance and sample efficiency in EV charging scenarios with robust generalization.

Authors:Stavros Orfanoudakis, Nanda Kishor Panda, Peter Palensky, Pedro P. Vergara
Title: GNN-DT: Graph Neural Network Enhanced Decision Transformer for Efficient Optimization in Dynamic Environments
Abstract:
Reinforcement Learning (RL) methods used for solving real-world optimization problems often involve dynamic state-action spaces, larger scale, and sparse rewards, leading to significant challenges in convergence, scalability, and efficient exploration of the solution space. This study introduces GNN-DT, a novel Decision Transformer (DT) architecture that integrates Graph Neural Network (GNN) embedders with a novel residual connection between input and output tokens crucial for handling dynamic environments. By learning from previously collected trajectories, GNN-DT tackles the sparse rewards limitations of online RL algorithms and delivers high-quality solutions in real-time. We evaluate GNN-DT on the complex electric vehicle (EV) charging optimization problem and prove that its performance is superior and requires significantly fewer training trajectories, thus improving sample efficiency compared to existing DT and offline RL baselines. Furthermore, GNN-DT exhibits robust generalization to unseen environments and larger action spaces, addressing a critical gap in prior offline and online RL approaches.
中文: GNN-DT是一种新型决策变换器,结合图神经网络和残差连接,有效应对动态环境、稀疏奖励和大规模优化问题,在电动汽车充电场景中展现出卓越性能、样本效率和强大泛化能力。
English: GNN-DT is a novel Decision Transformer that integrates Graph Neural Networks and a residual connection to effectively handle dynamic environments, sparse rewards, and large-scale optimization, demonstrating superior performance and sample efficiency in EV charging scenarios with robust generalization.

Authors:Chengkai Xu, Jiaqi Liu, Shiyu Fang, Yiming Cui, Dong Chen, Peng Hang, Jian Sun
Title: TeLL-Drive: Enhancing Autonomous Driving with Teacher LLM-Guided Deep Reinforcement Learning
Abstract:
Although Deep Reinforcement Learning (DRL) and Large Language Models (LLMs) each show promise in addressing decision-making challenges in autonomous driving, DRL often suffers from high sample complexity, while LLMs have difficulty ensuring real-time decision making. To address these limitations, we propose TeLL-Drive, a hybrid framework that integrates a Teacher LLM to guide an attention-based Student DRL policy. By incorporating risk metrics, historical scenario retrieval, and domain heuristics into context-rich prompts, the LLM produces high-level driving strategies through chain-of-thought reasoning. A self-attention mechanism then fuses these strategies with the DRL agent's exploration, accelerating policy convergence and boosting robustness across diverse driving conditions. The experimental results, evaluated across multiple traffic scenarios, show that TeLL-Drive outperforms existing baseline methods, including other LLM-based approaches, in terms of success rates, average returns, and real-time feasibility. Ablation studies underscore the importance of each model component, especially the synergy between the attention mechanism and LLM-driven guidance. Finally, we build a virtual-real fusion experimental platform to verify the real-time performance, robustness, and reliability of the algorithm running on real vehicles through vehicle-in-loop experiments.
中文摘要:TeLL-Drive是一种混合框架,通过教师大语言模型的战略指导与学生深度强化学习策略相结合,利用注意力机制和情境提示提升自动驾驶在不同场景下的综合性能。
English summary: TeLL-Drive is a hybrid framework that combines a Teacher LLM's strategic guidance with a Student DRL policy, using attention mechanisms and contextual prompts to enhance autonomous driving performance across various scenarios.

Authors:Xiucheng Wang, Xuan Zhao, Nan Cheng
Title: Differentiable Projection-based Learn to Optimize in Wireless Network-Part I: Convex Constrained (Non-)Convex Programming
Abstract:
This paper addresses a class of (non-)convex optimization problems subject to general convex constraints, which pose significant challenges for traditional methods due to their inherent non-convexity and diversity. Conventional convex optimization-based solvers often struggle to efficiently handle these problems in their most general form. While neural network (NN)-based approaches offer a promising alternative, ensuring the feasibility of NN-generated solutions and effectively training the NN remain key hurdles, largely because finite-capacity networks can produce infeasible outputs. To overcome these issues, we propose a projection-based method that projects any infeasible NN output onto the feasible domain, thus guaranteeing strict adherence to the constraints without compromising the NN's optimization capability. Furthermore, we derive the objective function values for both the raw NN outputs and their projected counterparts, along with the gradients of these values with respect to the NN parameters. This derivation enables label-free (unsupervised) training, reducing reliance on labeled data and improving scalability. Experimental results demonstrate that the proposed projection-based method consistently ensures feasibility.
中文: 本文提出了一种基于投影的神经网络方法,通过将输出投影到可行域来保证复杂优化问题的解可行性,并通过推导梯度实现无监督训练。
English: This paper introduces a projection-based neural network approach that guarantees feasible solutions for complex optimization problems by projecting outputs onto the feasible domain and enabling unsupervised training through derived gradients.

Authors:Abdelrahman Abdallah, Bhawna Piryani, Jonas Wallat, Avishek Anand, Adam Jatowt
Title: TempRetriever: Fusion-based Temporal Dense Passage Retrieval for Time-Sensitive Questions
Abstract:
Temporal awareness is crucial in many information retrieval tasks, particularly in scenarios where the relevance of documents depends on their alignment with the query's temporal context. Traditional approaches such as BM25 and Dense Passage Retrieval (DPR) focus on lexical or semantic similarity but tend to neglect the temporal alignment between queries and documents, which is essential for time-sensitive tasks like temporal question answering (TQA). We propose TempRetriever, a novel extension of DPR that explicitly incorporates temporal information by embedding both the query date and document timestamp into the retrieval process. This allows retrieving passages that are not only contextually relevant but also aligned with the temporal intent of queries. We evaluate TempRetriever on two large-scale datasets ArchivalQA and ChroniclingAmericaQA demonstrating its superiority over baseline retrieval models across multiple metrics. TempRetriever achieves a 6.63\% improvement in Top-1 retrieval accuracy and a 3.79\% improvement in NDCG@10 compared to the standard DPR on ArchivalQA. Similarly, for ChroniclingAmericaQA, TempRetriever exhibits a 9.56\% improvement in Top-1 retrieval accuracy and a 4.68\% improvement in NDCG@10. We also propose a novel, time-based negative sampling strategy which further enhances retrieval performance by addressing temporal misalignment during training. Our results underline the importance of temporal aspects in dense retrieval systems and establish a new benchmark for time-aware passage retrieval.
中文: TempRetriever通过将时间信息融入密集段落检索,在时间敏感数据集上显著提升了检索准确率和NDCG评分。
English: TempRetriever enhances Dense Passage Retrieval by incorporating temporal information, achieving significant improvements in retrieval accuracy and NDCG scores on time-sensitive datasets.

Authors:Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, Mohammed Ali, Adam Jatowt
Title: From Retrieval to Generation: Comparing Different Approaches
Abstract:
Knowledge-intensive tasks, particularly open-domain question answering (ODQA), document reranking, and retrieval-augmented language modeling, require a balance between retrieval accuracy and generative flexibility. Traditional retrieval models such as BM25 and Dense Passage Retrieval (DPR), efficiently retrieve from large corpora but often lack semantic depth. Generative models like GPT-4-o provide richer contextual understanding but face challenges in maintaining factual consistency. In this work, we conduct a systematic evaluation of retrieval-based, generation-based, and hybrid models, with a primary focus on their performance in ODQA and related retrieval-augmented tasks. Our results show that dense retrievers, particularly DPR, achieve strong performance in ODQA with a top-1 accuracy of 50.17\% on NQ, while hybrid models improve nDCG@10 scores on BEIR from 43.42 (BM25) to 52.59, demonstrating their strength in document reranking. Additionally, we analyze language modeling tasks using WikiText-103, showing that retrieval-based approaches like BM25 achieve lower perplexity compared to generative and hybrid methods, highlighting their utility in retrieval-augmented generation. By providing detailed comparisons and practical insights into the conditions where each approach excels, we aim to facilitate future optimizations in retrieval, reranking, and generative models for ODQA and related knowledge-intensive applications.
Chinese: 本研究系统评估了基于检索、生成及混合模型在开放域问答等知识密集型任务中的表现,结果表明稠密检索器在准确性上表现优异,而混合模型提升了文档重排序效果,并为优化这些方法提供了实用指导。
English: This study systematically evaluates retrieval-based, generation-based, and hybrid models for knowledge-intensive tasks like open-domain question answering, showing that dense retrievers excel in accuracy while hybrid models improve document reranking, with practical insights provided for optimizing these approaches.

Authors:Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie
Title: Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy
Abstract:
Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA)} dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft. Please see the project page at https://cybertronagent.github.io/Optimus-2.github.io/.
中文摘要:本文提出Optimus-2智能体,通过结合多模态大语言模型的高层规划与目标-观察-动作条件策略的底层控制,并利用新型MGOA数据集,在《我的世界》各类任务中展现出卓越性能。
English Summary: The paper introduces Optimus-2, a Minecraft agent combining a Multimodal Large Language Model for planning with a Goal-Observation-Action Conditioned Policy for control, demonstrating superior performance across various tasks through a novel dataset and architecture.

Authors:Qingpei Guo, Kaiyou Song, Zipeng Feng, Ziping Ma, Qinglong Zhang, Sirui Gao, Xuzheng Yu, Yunxiao Sun, Tai-Wei Chang, Jingdong Chen, Ming Yang, Jun Zhou
Title: M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
Abstract:
We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni's language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.
中文: M2-omni是一款能与GPT-4o媲美的开源多模态模型,它能处理并生成音频、视频、图像和文本,通过平衡训练策略保持语言能力,推动全模态大语言模型的发展。
English: M2-omni is a competitive open-source multimodal model that processes and generates audio, video, image, and text, utilizing balanced training strategies to maintain strong language performance while advancing omni-MLLM development.

Authors:Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan
Title: AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
Abstract:
Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, thereby enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, rendering it computationally infeasible to include all responses in the training objective. In this work, we propose $\textit{Active Multi-Preference Optimization}$ (AMPO), a novel approach that combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses and then select a small, yet informative, subset that covers reward extremes and distinct semantic clusters for preference optimization. Our contrastive training scheme is capable of identifying not only the best and worst answers but also subtle, underexplored modes that are crucial for robust alignment. Theoretically, we provide guarantees for expected reward maximization using our active selection method, and empirically, AMPO achieves state-of-the-art results on $\textit{AlpacaEval}$ using Llama 8B and Mistral 7B. We release our datasets $\href{https://huggingface.co/Multi-preference-Optimization}{here}$.
中文摘要:主动多偏好优化(AMPO)提出了一种新颖的对齐方法,通过主动选择和多偏好对比训练从大规模候选回答中高效选取信息丰富的子集进行优化,在保证理论性能的同时实现了最先进的实验结果。
English Summary: Active Multi-Preference Optimization (AMPO) introduces a novel alignment method that efficiently selects informative response subsets from large candidate pools using active selection and multi-preference contrastive training, achieving state-of-the-art performance with theoretical guarantees.

Authors:Weiji Xie, Chenjia Bai, Jiyuan Shi, Junkai Yang, Yunfei Ge, Weinan Zhang, Xuelong Li
Title: Humanoid Whole-Body Locomotion on Narrow Terrain via Dynamic Balance and Reinforcement Learning
Abstract:
Humans possess delicate dynamic balance mechanisms that enable them to maintain stability across diverse terrains and under extreme conditions. However, despite significant advances recently, existing locomotion algorithms for humanoid robots are still struggle to traverse extreme environments, especially in cases that lack external perception (e.g., vision or LiDAR). This is because current methods often rely on gait-based or perception-condition rewards, lacking effective mechanisms to handle unobservable obstacles and sudden balance loss. To address this challenge, we propose a novel whole-body locomotion algorithm based on dynamic balance and Reinforcement Learning (RL) that enables humanoid robots to traverse extreme terrains, particularly narrow pathways and unexpected obstacles, using only proprioception. Specifically, we introduce a dynamic balance mechanism by leveraging an extended measure of Zero-Moment Point (ZMP)-driven rewards and task-driven rewards in a whole-body actor-critic framework, aiming to achieve coordinated actions of the upper and lower limbs for robust locomotion. Experiments conducted on a full-sized Unitree H1-2 robot verify the ability of our method to maintain balance on extremely narrow terrains and under external disturbances, demonstrating its effectiveness in enhancing the robot's adaptability to complex environments. The videos are given at https://whole-body-loco.github.io.
中文摘要:本文提出一种基于动态平衡和强化学习的全身运动算法,通过零力矩点驱动奖励和任务奖励的协调机制,使人形机器人仅凭本体感知即可在极端地形上保持平衡并应对外部干扰。
English Summary: This paper introduces a novel whole-body locomotion algorithm using dynamic balance and reinforcement learning, enabling humanoid robots to navigate extreme terrains with only proprioception by coordinating upper and lower limbs through specialized balance rewards.

Authors:Muhammad Haris Khan, Artyom Myshlyaev, Artem Lykov, Miguel Altamirano Cabrera, Dzmitry Tsetserukou
Title: Evolution 6.0: Evolving Robotic Capabilities Through Generative Design
Abstract:
We propose a new concept, Evolution 6.0, which represents the evolution of robotics driven by Generative AI. When a robot lacks the necessary tools to accomplish a task requested by a human, it autonomously designs the required instruments and learns how to use them to achieve the goal. Evolution 6.0 is an autonomous robotic system powered by Vision-Language Models (VLMs), Vision-Language Action (VLA) models, and Text-to-3D generative models for tool design and task execution. The system comprises two key modules: the Tool Generation Module, which fabricates task-specific tools from visual and textual data, and the Action Generation Module, which converts natural language instructions into robotic actions. It integrates QwenVLM for environmental understanding, OpenVLA for task execution, and Llama-Mesh for 3D tool generation. Evaluation results demonstrate a 90% success rate for tool generation with a 10-second inference time, and action generation achieving 83.5% in physical and visual generalization, 70% in motion generalization, and 37% in semantic generalization. Future improvements will focus on bimanual manipulation, expanded task capabilities, and enhanced environmental interpretation to improve real-world adaptability.
中文摘要:Evolution 6.0是一种自主机器人系统,通过生成式AI在缺少工具时自主设计制造工具,并借助视觉语言模型实现高达90%的工具生成成功率和83.5%的动作泛化能力。
English Summary: Evolution 6.0 is an autonomous robotic system that uses generative AI to design and fabricate tools when needed, achieving high success rates in tool generation and task execution through integrated vision-language models.

Authors:Maike Züfle, Sara Papi, Beatrice Savoldi, Marco Gaido, Luisa Bentivogli, Jan Niehues
Title: NUTSHELL: A Dataset for Abstract Generation from Scientific Talks
Abstract:
Scientific communication is receiving increasing attention in natural language processing, especially to help researches access, summarize, and generate content. One emerging application in this area is Speech-to-Abstract Generation (SAG), which aims to automatically generate abstracts from recorded scientific presentations. SAG enables researchers to efficiently engage with conference talks, but progress has been limited by a lack of large-scale datasets. To address this gap, we introduce NUTSHELL, a novel multimodal dataset of *ACL conference talks paired with their corresponding abstracts. We establish strong baselines for SAG and evaluate the quality of generated abstracts using both automatic metrics and human judgments. Our results highlight the challenges of SAG and demonstrate the benefits of training on NUTSHELL. By releasing NUTSHELL under an open license (CC-BY 4.0), we aim to advance research in SAG and foster the development of improved models and evaluation methods.
中文: 本研究推出了NUTSHELL多模态数据集,通过提供配对的学术报告与摘要,为语音到摘要生成领域建立基准并验证其有效性,旨在推动该技术发展。
English: The study introduces NUTSHELL, a multimodal dataset designed to advance Speech-to-Abstract Generation (SAG) by providing paired conference talks and abstracts, establishing baselines and demonstrating its utility through evaluations.

Authors:Bhawna Piryani, Jamshid Mozafari, Abdelrahman Abdallah, Antoine Doucet, Adam Jatowt
Title: Evaluating Robustness of LLMs in Question Answering on Multilingual Noisy OCR Data
Abstract:
Optical Character Recognition (OCR) plays a crucial role in digitizing historical and multilingual documents, yet OCR errors - imperfect extraction of text, including character insertion, deletion, and substitution can significantly impact downstream tasks like question-answering (QA). In this work, we conduct a comprehensive analysis of how OCR-induced noise affects the performance of Multilingual QA Systems. To support this analysis, we introduce a multilingual QA dataset MultiOCR-QA, comprising 50K question-answer pairs across three languages, English, French, and German. The dataset is curated from OCR-ed historical documents, which include different levels and types of OCR noise. We then evaluate how different state-of-the-art Large Language Models (LLMs) perform under different error conditions, focusing on three major OCR error types. Our findings show that QA systems are highly prone to OCR-induced errors and perform poorly on noisy OCR text. By comparing model performance on clean versus noisy texts, we provide insights into the limitations of current approaches and emphasize the need for more noise-resilient QA systems in historical digitization contexts.
中文: 本研究通过构建多语言问答数据集,分析了OCR错误对问答系统的影响,发现现有模型对噪声文本表现不佳,强调了历史文献数字化中开发抗干扰系统的必要性。
English: This study analyzes how OCR errors impact multilingual question-answering systems, revealing their vulnerability to noise through a curated dataset and highlighting the need for more resilient models in historical document processing.

Authors:Hongzhe Cheng, Tianyou Zheng, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi
Title: DOSE3 : Diffusion-based Out-of-distribution detection on SE(3) trajectories
Abstract:
Out-of-Distribution(OOD) detection, a fundamental machine learning task aimed at identifying abnormal samples, traditionally requires model retraining for different inlier distributions. While recent research demonstrates the applicability of diffusion models to OOD detection, existing approaches are limited to Euclidean or latent image spaces. Our work extends OOD detection to trajectories in the Special Euclidean Group in 3D ($\mathbb{SE}(3)$), addressing a critical need in computer vision, robotics, and engineering applications that process object pose sequences in $\mathbb{SE}(3)$. We present $\textbf{D}$iffusion-based $\textbf{O}$ut-of-distribution detection on $\mathbb{SE}(3)$ ($\mathbf{DOSE3}$), a novel OOD framework that extends diffusion to a unified sample space of $\mathbb{SE}(3)$ pose sequences. Through extensive validation on multiple benchmark datasets, we demonstrate $\mathbf{DOSE3}$'s superior performance compared to state-of-the-art OOD detection frameworks.
中文: 本研究提出DOSE3这一创新框架,将基于扩散模型的分布外检测扩展至三维特殊欧几里得群中的姿态轨迹,并通过多基准测试验证了其优于现有方法的卓越性能。
English: This study introduces DOSE3, a novel diffusion-based framework that extends out-of-distribution detection to 3D pose trajectories in the Special Euclidean Group, demonstrating superior performance over existing methods through comprehensive benchmark evaluations.

Authors:Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt
Title: Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores
Abstract:
Large Language Models (LLMs) are revolutionizing information retrieval, with chatbots becoming an important source for answering user queries. As by their design, LLMs prioritize generating correct answers, the value of highly plausible yet incorrect answers (candidate answers) tends to be overlooked. However, such answers can still prove useful, for example, they can play a crucial role in tasks like Multiple-Choice Question Answering (MCQA) and QA Robustness Assessment (QARA). Existing QA datasets primarily focus on correct answers without explicit consideration of the plausibility of other candidate answers, limiting opportunity for more nuanced evaluations of models. To address this gap, we introduce PlausibleQA, a large-scale dataset comprising 10,000 questions and 100,000 candidate answers, each annotated with plausibility scores and justifications for their selection. Additionally, the dataset includes 900,000 justifications for pairwise comparisons between candidate answers, further refining plausibility assessments. We evaluate PlausibleQA through human assessments and empirical experiments, demonstrating its utility in MCQA and QARA analysis. Our findings show that plausibility-aware approaches are effective for MCQA distractor generation and QARA. We release PlausibleQA as a resource for advancing QA research and enhancing LLM performance in distinguishing plausible distractors from correct answers.
Chinese: 大型语言模型常忽视看似合理但错误的答案,因此我们开发了PlausibleQA数据集,通过标注答案的合理度评分和理由,来增强多项选择题回答和问答鲁棒性评估等任务的性能。
English: Large Language Models often overlook plausible but incorrect answers, so we created PlausibleQA, a dataset with annotated plausibility scores and justifications to improve tasks like Multiple-Choice Question Answering and QA Robustness Assessment.

Authors:Francesco Bacchiocchi, Jiarui Gan, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Title: Contract Design Under Approximate Best Responses
Abstract:
Principal-agent problems model scenarios where a principal incentivizes an agent to take costly, unobservable actions through the provision of payments. Such problems are ubiquitous in several real-world applications, ranging from blockchain to the delegation of machine learning tasks. In this paper, we initiate the study of hidden-action principal-agent problems under approximate best responses, in which the agent may select any action that is not too much suboptimal given the principal's payment scheme (a.k.a. contract). Our main result is a polynomial-time algorithm to compute an optimal contract under approximate best responses. This positive result is perhaps surprising, since, in Stackelberg games, computing an optimal commitment under approximate best responses is computationally intractable. We also investigate the learnability of contracts under approximate best responses, by providing a no-regret learning algorithm for a natural application scenario where the principal has no prior knowledge about the environment.
中文: 本文提出了一种在近似最优响应下计算委托-代理问题中最优合约的多项式时间算法,解决了代理人选择接近最优行动的场景,并通过无遗憾学习方法探讨了在委托方缺乏环境知识时的合约可学习性。
English: This paper introduces a polynomial-time algorithm for computing optimal contracts in principal-agent problems under approximate best responses, addressing scenarios where agents choose near-optimal actions, and also explores contract learnability through a no-regret learning approach when the principal lacks environmental knowledge.

Authors:Junhyeok Kim, Min Soo Kim, Jiwan Chung, Jungbin Cho, Jisoo Kim, Sungwoong Kim, Gyeongbo Sim, Youngjae Yu
Title: EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild
Abstract:
Predicting when to initiate speech in real-world environments remains a fundamental challenge for conversational agents. We introduce EgoSpeak, a novel framework for real-time speech initiation prediction in egocentric streaming video. By modeling the conversation from the speaker's first-person viewpoint, EgoSpeak is tailored for human-like interactions in which a conversational agent must continuously observe its environment and dynamically decide when to talk. Our approach bridges the gap between simplified experimental setups and complex natural conversations by integrating four key capabilities: (1) first-person perspective, (2) RGB processing, (3) online processing, and (4) untrimmed video processing. We also present YT-Conversation, a diverse collection of in-the-wild conversational videos from YouTube, as a resource for large-scale pretraining. Experiments on EasyCom and Ego4D demonstrate that EgoSpeak outperforms random and silence-based baselines in real time. Our results also highlight the importance of multimodal input and context length in effectively deciding when to speak.
中文: EgoSpeak是一种新颖的实时语音起始预测框架,通过整合第一人称视角、RGB处理、在线及未剪辑视频处理能力,弥合了实验设置与自然对话之间的差距,并在多模态实验中优于基线方法。
English: EgoSpeak is a novel framework that predicts speech initiation in real-time from egocentric video, bridging experimental setups and natural conversations by integrating first-person perspective, RGB processing, online and untrimmed video capabilities, and outperforms baselines in multimodal experiments.

Authors:Shashwat Jaiswal, Kunal Jain, Yogesh Simmhan, Anjaly Parayil, Ankur Mallick, Rujia Wang, Renee St. Amant, Chetan Bansal, Victor Rühle, Anoop Kulkarni, Steve Kofsky, Saravan Rajmohan
Title: SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
Abstract:
Global cloud service providers handle inference workloads for Large Language Models (LLMs) that span latency-sensitive (e.g., chatbots) and insensitive (e.g., report writing) tasks, resulting in diverse and often conflicting Service Level Agreement (SLA) requirements. Managing such mixed workloads is challenging due to the complexity of the inference serving stack, which encompasses multiple models, GPU hardware, and global data centers. Existing solutions often silo such fast and slow tasks onto separate GPU resource pools with different SLAs, but this leads to significant under-utilization of expensive accelerators due to load mismatch. In this article, we characterize the LLM serving workloads at Microsoft Office 365, one of the largest users of LLMs within Microsoft Azure cloud with over 10 million requests per day, and highlight key observations across workloads in different data center regions and across time. This is one of the first such public studies of Internet-scale LLM workloads. We use these insights to propose SageServe, a comprehensive LLM serving framework that dynamically adapts to workload demands using multi-timescale control knobs. It combines short-term request routing to data centers with long-term scaling of GPU VMs and model placement with higher lead times, and co-optimizes the routing and resource allocation problem using a traffic forecast model and an Integer Linear Programming (ILP) solution. We evaluate SageServe through real runs and realistic simulations on 10 million production requests across three regions and four open-source models. We achieve up to 25% savings in GPU-hours compared to the current baseline deployment and reduce GPU-hour wastage due to inefficient auto-scaling by 80%, resulting in a potential monthly cost savings of up to $2.5 million, while maintaining tail latency and meeting SLAs.
中文: 全球云服务商在管理具有不同服务等级协议的混合大语言模型工作负载时面临GPU利用率低的挑战,而SageServe通过多时间尺度的动态优化方案,在保证性能的同时实现GPU小时消耗降低25%,每月潜在节省成本达250万美元。
English: Global cloud providers face challenges in efficiently managing mixed LLM workloads with conflicting SLAs, leading to GPU underutilization, which SageServe addresses through dynamic multi-timescale optimization to save up to 25% in GPU-hours and reduce costs by $2.5 million monthly while meeting performance requirements.

Authors:Pengfei He, Yue Xing, Han Xu, Zhen Xiang, Jiliang Tang
Title: Multi-Faceted Studies on Data Poisoning can Advance LLM Development
Abstract:
The lifecycle of large language models (LLMs) is far more complex than that of traditional machine learning models, involving multiple training stages, diverse data sources, and varied inference methods. While prior research on data poisoning attacks has primarily focused on the safety vulnerabilities of LLMs, these attacks face significant challenges in practice. Secure data collection, rigorous data cleaning, and the multistage nature of LLM training make it difficult to inject poisoned data or reliably influence LLM behavior as intended. Given these challenges, this position paper proposes rethinking the role of data poisoning and argue that multi-faceted studies on data poisoning can advance LLM development. From a threat perspective, practical strategies for data poisoning attacks can help evaluate and address real safety risks to LLMs. From a trustworthiness perspective, data poisoning can be leveraged to build more robust LLMs by uncovering and mitigating hidden biases, harmful outputs, and hallucinations. Moreover, from a mechanism perspective, data poisoning can provide valuable insights into LLMs, particularly the interplay between data and model behavior, driving a deeper understanding of their underlying mechanisms.
中文摘要:大型语言模型生命周期的复杂性使数据投毒攻击在实践中面临挑战,但相关研究可从威胁、可信度和机制角度促进模型安全评估、鲁棒性提升及内在机理探索。
English Summary: The complexity of large language models' lifecycle makes practical data poisoning attacks challenging, yet studying them can enhance LLM safety, robustness, and mechanistic understanding.

Authors:Katie Z Luo, Minh-Quan Dao, Zhenzhen Liu, Mark Campbell, Wei-Lun Chao, Kilian Q. Weinberger, Ezio Malis, Vincent Fremont, Bharath Hariharan, Mao Shan, Stewart Worrall, Julie Stephany Berrio Perez
Title: Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration
Abstract:
Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. The Mixed Signals dataset is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. Dataset website is available at https://mixedsignalsdataset.cs.cornell.edu/.
Chinese: Mixed Signals数据集通过整合多辆互联自动驾驶汽车和路边单元采集的全面高质量数据,提供了详细的标注和基准测试,弥补了现有V2X数据集在范围、多样性和质量上的不足。
English: The Mixed Signals dataset addresses the limitations of existing V2X datasets by providing comprehensive, high-quality data from multiple connected autonomous vehicles and a roadside unit, complete with detailed annotations and benchmarking for V2X collaborative perception research.

Authors:Katie Z Luo, Minh-Quan Dao, Zhenzhen Liu, Mark Campbell, Wei-Lun Chao, Kilian Q. Weinberger, Ezio Malis, Vincent Fremont, Bharath Hariharan, Mao Shan, Stewart Worrall, Julie Stephany Berrio Perez
Title: Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration
Abstract:
Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. The Mixed Signals dataset is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. Dataset website is available at https://mixedsignalsdataset.cs.cornell.edu/.
Chinese: Mixed Signals数据集通过整合多辆互联自动驾驶汽车和路边单元采集的全面高质量数据,提供了详细的标注和基准测试,弥补了现有V2X数据集在范围、多样性和质量上的不足。
English: The Mixed Signals dataset addresses the limitations of existing V2X datasets by providing comprehensive, high-quality data from multiple connected autonomous vehicles and a roadside unit, complete with detailed annotations and benchmarking for V2X collaborative perception research.

Authors:Hongjin Qian, Zheng Liu, Chao Gao, Yankai Wang, Defu Lian, Zhicheng Dou
Title: HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks
Abstract:
In real-world information-seeking scenarios, users have dynamic and diverse needs, requiring RAG systems to demonstrate adaptable resilience. To comprehensively evaluate the resilience of current RAG methods, we introduce HawkBench, a human-labeled, multi-domain benchmark designed to rigorously assess RAG performance across categorized task types. By stratifying tasks based on information-seeking behaviors, HawkBench provides a systematic evaluation of how well RAG systems adapt to diverse user needs. Unlike existing benchmarks, which focus primarily on specific task types (mostly factoid queries) and rely on varying knowledge bases, HawkBench offers: (1) systematic task stratification to cover a broad range of query types, including both factoid and rationale queries, (2) integration of multi-domain corpora across all task types to mitigate corpus bias, and (3) rigorous annotation for high-quality evaluation. HawkBench includes 1,600 high-quality test samples, evenly distributed across domains and task types. Using this benchmark, we evaluate representative RAG methods, analyzing their performance in terms of answer quality and response latency. Our findings highlight the need for dynamic task strategies that integrate decision-making, query interpretation, and global knowledge understanding to improve RAG generalizability. We believe HawkBench serves as a pivotal benchmark for advancing the resilience of RAG methods and their ability to achieve general-purpose information seeking.
中文: HawkBench是一个人工标注的多领域基准,通过任务分层和多领域语料库系统评估RAG系统在多样化查询中的适应能力,旨在克服现有基准的局限,提升信息检索的普适性与鲁棒性。
English: HawkBench is a human-labeled, multi-domain benchmark designed to systematically evaluate the resilience of RAG systems across diverse query types and domains, addressing limitations of existing benchmarks by incorporating task stratification and multi-domain corpora to enhance adaptability and generalizability.

Authors:Bo Wang, Weiyi He, Shenglai Zeng, Zhen Xiang, Yue Xing, Jiliang Tang, Pengfei He
Title: Unveiling Privacy Risks in LLM Agent Memory
Abstract:
Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent designer's and the attacker's perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.
中文摘要:大型语言模型智能体面临记忆提取攻击(如MEXTRA)带来的隐私风险,该攻击能有效窃取用户隐私数据,凸显了加强记忆保护措施的紧迫性。
English Summary: Large Language Model agents face privacy risks from memory extraction attacks like MEXTRA, which effectively steal private user data, highlighting the urgent need for better memory protection measures.

Authors:Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, Xin Eric Wang
Title: The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
Abstract:
The rapid development of large reasoning models, such as OpenAI-o3 and DeepSeek-R1, has led to significant improvements in complex reasoning over non-reasoning large language models~(LLMs). However, their enhanced capabilities, combined with the open-source access of models like DeepSeek-R1, raise serious safety concerns, particularly regarding their potential for misuse. In this work, we present a comprehensive safety assessment of these reasoning models, leveraging established safety benchmarks to evaluate their compliance with safety regulations. Furthermore, we investigate their susceptibility to adversarial attacks, such as jailbreaking and prompt injection, to assess their robustness in real-world applications. Through our multi-faceted analysis, we uncover four key findings: (1) There is a significant safety gap between the open-source R1 models and the o3-mini model, on both safety benchmark and attack, suggesting more safety effort on R1 is needed. (2) The distilled reasoning model shows poorer safety performance compared to its safety-aligned base models. (3) The stronger the model's reasoning ability, the greater the potential harm it may cause when answering unsafe questions. (4) The thinking process in R1 models pose greater safety concerns than their final answers. Our study provides insights into the security implications of reasoning models and highlights the need for further advancements in R1 models' safety to close the gap.
中文: 本研究对OpenAI-o3和DeepSeek-R1等推理模型进行安全评估,发现开源模型存在显著安全缺陷,其增强的推理能力反而会在处理危险问题时产生更大危害,凸显了加强安全措施的必要性。
English: This study conducts a safety evaluation of reasoning models like OpenAI-o3 and DeepSeek-R1, revealing critical vulnerabilities including a significant safety gap between open-source and proprietary models, and demonstrating that enhanced reasoning capabilities correlate with increased potential harm when misused.

Authors:Amaury Gouverneur, Tobias J. Oechtering, Mikael Skoglund
Title: Refined PAC-Bayes Bounds for Offline Bandits
Abstract:
In this paper, we present refined probabilistic bounds on empirical reward estimates for off-policy learning in bandit problems. We build on the PAC-Bayesian bounds from Seldin et al. (2010) and improve on their results using a new parameter optimization approach introduced by Rodríguez et al. (2024). This technique is based on a discretization of the space of possible events to optimize the "in probability" parameter. We provide two parameter-free PAC-Bayes bounds, one based on Hoeffding-Azuma's inequality and the other based on Bernstein's inequality. We prove that our bounds are almost optimal as they recover the same rate as would be obtained by setting the "in probability" parameter after the realization of the data.
中文: 本文针对非策略赌博机学习提出了改进的无参数PAC-贝叶斯边界,通过基于事件离散化的新型参数优化方法,在数据实现后获得近乎最优的概率边界性能。
English: This paper introduces enhanced parameter-free PAC-Bayesian bounds for off-policy bandit learning, achieving near-optimal performance through a novel event discretization technique that optimizes probability parameters post-data realization.

Authors:Haoxuan Li, Jifan Yu, Xin Cong, Yang Dang, Daniel Zhang-li, Yisi Zhan, Huiqin Liu, Zhiyuan Liu
Title: Exploring LLM-based Student Simulation for Metacognitive Cultivation
Abstract:
Metacognitive education plays a crucial role in cultivating students' self-regulation and reflective thinking, providing essential support for those with learning difficulties through academic advising. Simulating students with insufficient learning capabilities using large language models offers a promising approach to refining pedagogical methods without ethical concerns. However, existing simulations often fail to authentically represent students' learning struggles and face challenges in evaluation due to the lack of reliable metrics and ethical constraints in data collection. To address these issues, we propose a pipeline for automatically generating and filtering high-quality simulated student agents. Our approach leverages a two-round automated scoring system validated by human experts and employs a score propagation module to obtain more consistent scores across the student graph. Experimental results demonstrate that our pipeline efficiently identifies high-quality student agents, and we discuss the traits that influence the simulation's effectiveness. By simulating students with varying degrees of learning difficulties, our work paves the way for broader applications in personalized learning and educational assessment.
中文: 本研究提出一种自动化流程,利用大语言模型生成高质量模拟学生智能体,通过专家验证的评分和分数传播机制,有效模拟不同学习困难程度,为个性化学习和教育评估开辟了新途径。
English: This study introduces an automated pipeline for generating high-quality simulated student agents using large language models, validated through expert-reviewed scoring and score propagation, to enhance personalized learning and educational assessment by authentically representing diverse learning difficulties.

Authors:Zixiao Huang, Lifeng Guo, Wenhao Li, Junjie Sheng, Chuyun Shen, Haosheng Chen, Bo Jin, Changhong Lu, Xiangfeng Wang
Title: GraphThought: Graph Combinatorial Optimization with Thought Generation
Abstract:
Graph combinatorial optimization (GCO) problems are central to domains like logistics and bioinformatics. While traditional solvers dominate, large language models (LLMs) offer new possibilities for structured reasoning, yet struggle with complex GCO tasks requiring rigorous combinatorial analysis and multi-step deduction, often producing hallucinated steps. We first formalize the Optimal Thoughts Design (OTD) problem, which provides a structured guidance for producing high-quality intermediate reasoning steps. Building on this formulation, we introduce GraphThought, a novel framework that generates effective reasoning sequences through either heuristic-guided forward search or solver-aligned backward reasoning. By fine-tuning LLMs on these structured thought sequences, we develop Llama-GT, an 8B-parameter model that achieves state-of-the-art performance on the GraphArena benchmark, outperforming significantly larger models like DeepSeek-V3. Our results demonstrate that when scaffolded with structured reasoning priors, principled thought generation can significantly enhance LLM performance on GCO tasks without requiring increased model scale.
Chinese Summary: 本文提出了GraphThought框架,通过结构化推理解决复杂图优化问题,并证明基于此微调的Llama-GT模型无需扩大规模即可实现最优性能。
English Summary: The paper introduces GraphThought, a framework that structures reasoning for complex graph optimization tasks, and demonstrates that fine-tuning the resulting Llama-GT model achieves state-of-the-art performance without requiring larger models.

Authors:Wenwu Li, Xiangfeng Wang, Wenhao Li, Bo Jin
Title: A Survey of Automatic Prompt Engineering: An Optimization Perspective
Abstract:
The rise of foundation models has shifted focus from resource-intensive fine-tuning to prompt engineering, a paradigm that steers model behavior through input design rather than weight updates. While manual prompt engineering faces limitations in scalability, adaptability, and cross-modal alignment, automated methods, spanning foundation model (FM) based optimization, evolutionary methods, gradient-based optimization, and reinforcement learning, offer promising solutions. Existing surveys, however, remain fragmented across modalities and methodologies. This paper presents the first comprehensive survey on automated prompt engineering through a unified optimization-theoretic lens. We formalize prompt optimization as a maximization problem over discrete, continuous, and hybrid prompt spaces, systematically organizing methods by their optimization variables (instructions, soft prompts, exemplars), task-specific objectives, and computational frameworks. By bridging theoretical formulation with practical implementations across text, vision, and multimodal domains, this survey establishes a foundational framework for both researchers and practitioners, while highlighting underexplored frontiers in constrained optimization and agent-oriented prompt design.
Chinese: 本综述首次提出自动化提示工程的统一优化框架,按变量类型和目标对跨模态方法进行分类,并指出了未来研究方向。
English: This survey provides the first unified optimization framework for automated prompt engineering, categorizing methods by variable types and objectives across modalities while identifying future research directions.

Authors:Di Wu, Xian Wei, Guang Chen, Hao Shen, Xiangfeng Wang, Wenhao Li, Bo Jin
Title: Generative Multi-Agent Collaboration in Embodied AI: A Systematic Review
Abstract:
Embodied multi-agent systems (EMAS) have attracted growing attention for their potential to address complex, real-world challenges in areas such as logistics and robotics. Recent advances in foundation models pave the way for generative agents capable of richer communication and adaptive problem-solving. This survey provides a systematic examination of how EMAS can benefit from these generative capabilities. We propose a taxonomy that categorizes EMAS by system architectures and embodiment modalities, emphasizing how collaboration spans both physical and virtual contexts. Central building blocks, perception, planning, communication, and feedback, are then analyzed to illustrate how generative techniques bolster system robustness and flexibility. Through concrete examples, we demonstrate the transformative effects of integrating foundation models into embodied, multi-agent frameworks. Finally, we discuss challenges and future directions, underlining the significant promise of EMAS to reshape the landscape of AI-driven collaboration.
中文摘要:本综述探讨了具身多智能体系统如何利用生成式基础模型增强协作能力,通过提出分类框架并分析核心组件,揭示了其在人工智能驱动应用中的变革性潜力。
English Summary: This survey explores how embodied multi-agent systems can leverage generative foundation models to enhance collaboration, proposing a taxonomy and analyzing core components to demonstrate their transformative potential in AI-driven applications.

Authors:Ziyou Jiang, Mingyang Li, Guowei Yang, Junjie Wang, Yuekai Huang, Zhiyuan Chang, Qing Wang
Title: Mimicking the Familiar: Dynamic Command Generation for Information Theft Attacks in LLM Tool-Learning System
Abstract:
Information theft attacks pose a significant risk to Large Language Model (LLM) tool-learning systems. Adversaries can inject malicious commands through compromised tools, manipulating LLMs to send sensitive information to these tools, which leads to potential privacy breaches. However, existing attack approaches are black-box oriented and rely on static commands that cannot adapt flexibly to the changes in user queries and the invocation chain of tools. It makes malicious commands more likely to be detected by LLM and leads to attack failure. In this paper, we propose AutoCMD, a dynamic attack comment generation approach for information theft attacks in LLM tool-learning systems. Inspired by the concept of mimicking the familiar, AutoCMD is capable of inferring the information utilized by upstream tools in the toolchain through learning on open-source systems and reinforcement with target system examples, thereby generating more targeted commands for information theft. The evaluation results show that AutoCMD outperforms the baselines with +13.2% $ASR_{Theft}$, and can be generalized to new tool-learning systems to expose their information leakage risks. We also design four defense methods to effectively protect tool-learning systems from the attack.
中文摘要:AutoCMD是一种针对大型语言模型工具学习系统的动态攻击方法,通过模仿熟悉概念并学习开源系统,能够生成适应性更强的恶意指令以窃取信息,其攻击成功率比现有方法提高13.2%,同时研究还提出了四种有效防御措施。
English Summary: AutoCMD is a dynamic attack method that enhances information theft in LLM tool-learning systems by generating adaptive malicious commands through learning from open-source systems, achieving a 13.2% higher attack success rate than existing methods.

Authors:Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton
Title: Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings
Abstract:
Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for TMO that optimizes the inference location (i.e., local vs. cloud) and multi-modal data sources to use for each task/dialogue, aiming to maximize the long-term reward (response quality, latency, and usage cost) while adhering to resource constraints. We also contribute M4A1, a new dataset we curated that contains reward and cost metrics across multiple modality, task, dialogue, and LLM configurations, enabling evaluation of offloading decisions. We demonstrate the effectiveness of TMO compared to several exploration-decision and LLM-as-Agent baselines, showing significant improvements in latency, cost, and response quality.
中文: TMO是一种创新的本地-云端大语言模型推理系统,采用三M卸载策略——多模态、多任务和多对话,通过资源受限的强化学习优化任务分配,显著提升了响应延迟、成本和回答质量。
English: TMO is a novel local-cloud LLM inference system that employs a three-M offloading strategy—multi-modal, multi-task, and multi-dialogue—using a resource-constrained reinforcement learning approach to optimize task allocation between local and cloud resources, significantly enhancing latency, cost, and response quality.

Authors:Yepeng Liu, Xuandong Zhao, Dawn Song, Yuheng Bu
Title: Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs
Abstract:
Retrieval-Augmented Generation (RAG) has become an effective method for enhancing large language models (LLMs) with up-to-date knowledge. However, it poses a significant risk of IP infringement, as IP datasets may be incorporated into the knowledge database by malicious Retrieval-Augmented LLMs (RA-LLMs) without authorization. To protect the rights of the dataset owner, an effective dataset membership inference algorithm for RA-LLMs is needed. In this work, we introduce a novel approach to safeguard the ownership of text datasets and effectively detect unauthorized use by the RA-LLMs. Our approach preserves the original data completely unchanged while protecting it by inserting specifically designed canary documents into the IP dataset. These canary documents are created with synthetic content and embedded watermarks to ensure uniqueness, stealthiness, and statistical provability. During the detection process, unauthorized usage is identified by querying the canary documents and analyzing the responses of RA-LLMs for statistical evidence of the embedded watermark. Our experimental results demonstrate high query efficiency, detectability, and stealthiness, along with minimal perturbation to the original dataset, all without compromising the performance of the RAG system.
中文: 本文提出一种通过在知识产权数据集中嵌入带水印的验证文档来检测检索增强生成系统未经授权使用的新方法,既能高效识别侵权行为,又能保持数据完整性和系统性能。
English: This paper introduces a novel method using watermark-embedded canary documents to detect unauthorized use of IP datasets in Retrieval-Augmented Generation systems, achieving high detection efficiency while preserving data integrity and system performance.

Authors:Huanqing Wang, Kaixiang Zhang, Amin Vahidi-Moghaddam, Haowei An, Nan Li, Daning Huang, Zhaojian Li
Title: Data-Enabled Predictive Control for Flexible Spacecraft
Abstract:
Spacecraft are vital to space exploration and are often equipped with lightweight, flexible appendages to meet strict weight constraints. These appendages pose significant challenges for modeling and control due to their inherent nonlinearity. Data-driven control methods have gained traction to address such challenges. This paper introduces, to the best of the authors' knowledge, the first application of the data-enabled predictive control (DeePC) framework to boundary control for flexible spacecraft. Leveraging the fundamental lemma, DeePC constructs a non-parametric model by utilizing recorded past trajectories, eliminating the need for explicit model development. The developed method also incorporates dimension reduction techniques to enhance computational efficiency. Through comprehensive numerical simulations, this study compares the proposed method with Lyapunov-based control, demonstrating superior performance and offering a thorough evaluation of data-driven control for flexible spacecraft.
Chinese: 本文首次将数据驱动的预测控制应用于柔性航天器边界控制,利用轨迹数据规避显式建模,并通过仿真验证了其优于李雅普诺夫控制方法的性能。
English: This paper presents the first application of data-enabled predictive control for flexible spacecraft boundary control, using trajectory data to bypass explicit modeling and demonstrating superior performance over Lyapunov-based control through simulations.

Authors:Chenxiang Ma, Xinyi Chen, Yanchen Li, Qu Yang, Yujie Wu, Guoqi Li, Gang Pan, Huajin Tang, Kay Chen Tan, Jibin Wu
Title: Spiking Neural Networks for Temporal Processing: Status Quo and Future Prospects
Abstract:
Temporal processing is fundamental for both biological and artificial intelligence systems, as it enables the comprehension of dynamic environments and facilitates timely responses. Spiking Neural Networks (SNNs) excel in handling such data with high efficiency, owing to their rich neuronal dynamics and sparse activity patterns. Given the recent surge in the development of SNNs, there is an urgent need for a comprehensive evaluation of their temporal processing capabilities. In this paper, we first conduct an in-depth assessment of commonly used neuromorphic benchmarks, revealing critical limitations in their ability to evaluate the temporal processing capabilities of SNNs. To bridge this gap, we further introduce a benchmark suite consisting of three temporal processing tasks characterized by rich temporal dynamics across multiple timescales. Utilizing this benchmark suite, we perform a thorough evaluation of recently introduced SNN approaches to elucidate the current status of SNNs in temporal processing. Our findings indicate significant advancements in recently developed spiking neuron models and neural architectures regarding their temporal processing capabilities, while also highlighting a performance gap in handling long-range dependencies when compared to state-of-the-art non-spiking models. Finally, we discuss the key challenges and outline potential avenues for future research.
中文: 本文评估了脉冲神经网络的时间处理能力,引入了新的基准测试套件以解决现有局限,发现尽管近期SNN模型取得显著进展,但在处理长程依赖方面仍落后于非脉冲模型。
English: This paper evaluates Spiking Neural Networks' temporal processing capabilities, introduces a new benchmark suite to address existing limitations, and finds that while recent SNN models show significant advances, they still lag in handling long-range dependencies compared to non-spiking models.

Authors:Takumi Goto, Yusuke Sakai, Taro Watanabe
Title: Rethinking Evaluation Metrics for Grammatical Error Correction: Why Use a Different Evaluation Process than Human?
Abstract:
One of the goals of automatic evaluation metrics in grammatical error correction (GEC) is to rank GEC systems such that it matches human preferences. However, current automatic evaluations are based on procedures that diverge from human evaluation. Specifically, human evaluation derives rankings by aggregating sentence-level relative evaluation results, e.g., pairwise comparisons, using a rating algorithm, whereas automatic evaluation averages sentence-level absolute scores to obtain corpus-level scores, which are then sorted to determine rankings. In this study, we propose an aggregation method for existing automatic evaluation metrics which aligns with human evaluation methods to bridge this gap. We conducted experiments using various metrics, including edit-based metrics, n-gram based metrics, and sentence-level metrics, and show that resolving the gap improves results for the most of metrics on the SEEDA benchmark. We also found that even BERT-based metrics sometimes outperform the metrics of GPT-4. The proposed ranking method is integrated gec-metrics.
中文摘要:本研究提出了一种与人工评估方法一致的自动语法纠错指标聚合方法,在SEEDA基准测试中显示该方法提升了多数指标的性能,并发现基于BERT的指标有时能超越GPT-4指标。
English Summary: This study introduces an aggregation method for automatic grammatical error correction metrics that aligns with human evaluation procedures, demonstrating improved performance across most metrics on the SEEDA benchmark and showing that BERT-based metrics can sometimes surpass GPT-4 metrics.

Authors:Yiqi Chen, Holger Boche, Tobias J. Oechtering, Mikael Skoglund
Title: Integrated Sensing and Communication with Distributed Rate-Limited Helpers
Abstract:
This paper studies integrated sensing and communication (ISAC) systems with two rate-limited helpers who observe the channel state sequence and the feedback sequence, respectively. Depending on the timing of compressing and using the state information, our proposed coding scheme gives an inner bound of the capacity-compression-distortion tradeoff region. The tradeoff is realized by sending part of the state information at the beginning of the transmission to facilitate the communication and compressing the remaining part together with the feedback signal. The inner bound becomes tight bounds in several special cases.
Chinese: 本文提出了一种编码方案,通过策略性地压缩和利用信道状态与反馈信息,实现了集成感知与通信系统中容量-压缩-失真权衡的内界,并在特定情况下收紧为紧界。
English: This paper proposes a coding scheme that achieves an inner bound for the capacity-compression-distortion tradeoff in integrated sensing and communication systems by strategically compressing and utilizing channel state and feedback information, which tightens to exact bounds in specific cases.

Authors:Wenhui Ma, Wenhao Li, Bo Jin, Changhong Lu, Xiangfeng Wang
Title: SkyRover: A Modular Simulator for Cross-Domain Pathfinding
Abstract:
Unmanned Aerial Vehicles (UAVs) and Automated Guided Vehicles (AGVs) increasingly collaborate in logistics, surveillance, inspection tasks and etc. However, existing simulators often focus on a single domain, limiting cross-domain study. This paper presents the SkyRover, a modular simulator for UAV-AGV multi-agent pathfinding (MAPF). SkyRover supports realistic agent dynamics, configurable 3D environments, and convenient APIs for external solvers and learning methods. By unifying ground and aerial operations, it facilitates cross-domain algorithm design, testing, and benchmarking. Experiments highlight SkyRover's capacity for efficient pathfinding and high-fidelity simulations in UAV-AGV coordination. Project is available at https://sites.google.com/view/mapf3d/home.
Chinese Summary: 本文提出SkyRover模块化模拟器,通过整合真实动力学、可配置环境及外部工具,实现无人机与自动导引车的跨域多智能体路径规划,支持算法开发与测试。
English Summary: This paper introduces SkyRover, a modular simulator that enables cross-domain multi-agent pathfinding for UAVs and AGVs by integrating realistic dynamics, configurable environments, and external tools to support algorithm development and testing.

Authors:Takahiro Yonemaru, Weiwei Wan, Tatsuki Nishimura, Kensuke Harada
Title: Learning to Push, Group, and Grasp: A Diffusion Policy Approach for Multi-Object Delivery
Abstract:
Simultaneously grasping and delivering multiple objects can significantly enhance robotic work efficiency and has been a key research focus for decades. The primary challenge lies in determining how to push objects, group them, and execute simultaneous grasping for respective groups while considering object distribution and the hardware constraints of the robot. Traditional rule-based methods struggle to flexibly adapt to diverse scenarios. To address this challenge, this paper proposes an imitation learning-based approach. We collect a series of expert demonstrations through teleoperation and train a diffusion policy network, enabling the robot to dynamically generate action sequences for pushing, grouping, and grasping, thereby facilitating efficient multi-object grasping and delivery. We conducted experiments to evaluate the method under different training dataset sizes, varying object quantities, and real-world object scenarios. The results demonstrate that the proposed approach can effectively and adaptively generate multi-object grouping and grasping strategies. With the support of more training data, imitation learning is expected to be an effective approach for solving the multi-object grasping problem.
中文摘要:本文提出一种基于模仿学习的方法,通过扩散策略网络使机器人能够动态生成推送、分组和抓取的动作序列,在不同场景下实现高效自适应的多物体抓取与运送。
English Summary: This paper introduces an imitation learning-based method using a diffusion policy network to enable robots to dynamically generate action sequences for pushing, grouping, and grasping multiple objects, demonstrating effective and adaptive multi-object handling across various scenarios.

Authors:Hao Chen, Takuya Kiyokawa, Weiwei Wan, Kensuke Harada
Title: Adaptive Grasping of Moving Objects in Dense Clutter via Global-to-Local Detection and Static-to-Dynamic Planning
Abstract:
Robotic grasping is facing a variety of real-world uncertainties caused by non-static object states, unknown object properties, and cluttered object arrangements. The difficulty of grasping increases with the presence of more uncertainties, where commonly used learning-based approaches struggle to perform consistently across varying conditions. In this study, we integrate the idea of similarity matching to tackle the challenge of grasping novel objects that are simultaneously in motion and densely cluttered using a single RGBD camera, where multiple uncertainties coexist. We achieve this by shifting visual detection from global to local states and operating grasp planning from static to dynamic scenes. Notably, we introduce optimization methods to enhance planning efficiency for this time-sensitive task. Our proposed system can adapt to various object types, arrangements and movement speeds without the need for extensive training, as demonstrated by real-world experiments. Videos are available at https://youtu.be/sdC50dx-xp8?si=27oVr4dhG0rqN_tT.
中文: 本研究采用相似性匹配方法,通过单目RGBD相机提升了机器人在动态密集环境中抓取新物体的能力,无需大量训练即可适应不同物体类型和运动速度。
English: This study introduces a similarity matching approach to enhance robotic grasping of moving and cluttered novel objects using a single RGBD camera, improving adaptability across various conditions without extensive training.

Authors:Yangguang He, Wenhao Li, Minzhe Li, Juan Zhang, Xiangfeng Wang, Bo Jin
Title: TrackDiffuser: Nearly Model-Free Bayesian Filtering with Diffusion Model
Abstract:
State estimation remains a fundamental challenge across numerous domains, from autonomous driving, aircraft tracking to quantum system control. Although Bayesian filtering has been the cornerstone solution, its classical model-based paradigm faces two major limitations: it struggles with inaccurate state space model (SSM) and requires extensive prior knowledge of noise characteristics. We present TrackDiffuser, a generative framework addressing both challenges by reformulating Bayesian filtering as a conditional diffusion model. Our approach implicitly learns system dynamics from data to mitigate the effects of inaccurate SSM, while simultaneously circumventing the need for explicit measurement models and noise priors by establishing a direct relationship between measurements and states. Through an implicit predict-and-update mechanism, TrackDiffuser preserves the interpretability advantage of traditional model-based filtering methods. Extensive experiments demonstrate that our framework substantially outperforms both classical and contemporary hybrid methods, especially in challenging non-linear scenarios involving non-Gaussian noises. Notably, TrackDiffuser exhibits remarkable robustness to SSM inaccuracies, offering a practical solution for real-world state estimation problems where perfect models and prior knowledge are unavailable.
中文摘要:TrackDiffuser作为一种生成式框架,将贝叶斯滤波重构为条件扩散模型,通过从数据中学习系统动态来克服状态空间模型不准确和噪声先验知识缺失的局限,同时保持传统方法的可解释性优势。
English Summary: TrackDiffuser is a generative framework that reformulates Bayesian filtering as a conditional diffusion model, overcoming limitations of inaccurate state space models and noise priors by learning system dynamics directly from data while maintaining interpretability.

Authors:Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Guoqing Liu, Zexu Sun, Dong Li, Ning Yang, Jianye Hao, Haifeng Zhang, Jun Wang
Title: Evolving LLMs' Self-Refinement Capability via Iterative Preference Optimization
Abstract:
While large language models (LLMs) have demonstrated remarkable general performance, enabling smaller models to achieve capabilities comparable to their larger counterparts remains a critical challenge. For humans, iterative refinement of problem analysis and responses is a common strategy to enhance answer quality. However, we observe that existing LLMs exhibit limited ability to refine their outputs for quality improvement. In this paper, we first investigate mechanisms to unlock and progressively enhance self-refinement ability in smaller models within an iterative preference optimization framework, aiming to bridge the performance gap with larger models. To this end, we propose EVOLVE, a novel post-training and inference framework that iteratively integrates preference training with self-refinement-driven data collection. During training, EVOLVE strengthens the model's direct question-answering ability while simultaneously unlocking its self-refinement potential. At inference, the framework leverages this capability to generate progressively refined responses, which are filtered to construct datasets for subsequent rounds of preference training. Experiments demonstrate EVOLVE's exceptional performance: when applied to Llama-3.1-8B base model and under the self-refinement setting, it surpasses state-of-the-art models including Llama-3.1-405B-Instruct and GPT-4o, achieving a 62.3% length-controlled win rate and 63.3% raw win rate on AlpacaEval 2, along with a 50.3% win rate on Arena-Hard. Furthermore, EVOLVE consistently enhances performance on mathematical reasoning tasks like GSM8K and MATH.
中文: 本文提出EVOLVE框架,通过迭代偏好优化增强小语言模型的自我优化能力,使其在多项基准测试中超越GPT-4o等大型模型。
English: This paper introduces EVOLVE, a framework that enhances small language models' self-refinement capabilities through iterative preference optimization, enabling them to surpass larger models like GPT-4o in performance benchmarks.

Authors:Sara Saeidian, Tobias J. Oechtering, Mikael Skoglund
Title: Evaluating Differential Privacy on Correlated Datasets Using Pointwise Maximal Leakage
Abstract:
Data-driven advancements significantly contribute to societal progress, yet they also pose substantial risks to privacy. In this landscape, differential privacy (DP) has become a cornerstone in privacy preservation efforts. However, the adequacy of DP in scenarios involving correlated datasets has sometimes been questioned and multiple studies have hinted at potential vulnerabilities. In this work, we delve into the nuances of applying DP to correlated datasets by leveraging the concept of pointwise maximal leakage (PML) for a quantitative assessment of information leakage. Our investigation reveals that DP's guarantees can be arbitrarily weak for correlated databases when assessed through the lens of PML. More precisely, we prove the existence of a pure DP mechanism with PML levels arbitrarily close to that of a mechanism which releases individual entries from a database without any perturbation. By shedding light on the limitations of DP on correlated datasets, our work aims to foster a deeper understanding of subtle privacy risks and highlight the need for the development of more effective privacy-preserving mechanisms tailored to diverse scenarios.
Chinese: 差分隐私在关联数据集上的保护效果有限,点态最大泄露分析揭示其信息泄露程度接近未加扰的数据发布。
English: Differential privacy offers weak protection for correlated datasets, as demonstrated by pointwise maximal leakage analysis showing information leakage comparable to unperturbed data release.

Authors:Guanxu Chen, Dongrui Liu, Tao Luo, Lijie Hu, Jing Shao
Title: Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring
Abstract:
Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to monitor LLMs, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to monitor their latent thinking. However, previous methods only try to develop external monitors instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the applications of TELLME on trustworthiness tasks (\eg, safety risks monitoring tasks and detoxification tasks), where LLMs achieve consistent improvement in transparency and task performance. More crucially, we theoretically analyze the improvement of TELLME on LLMs' generalization ability through optimal transport theory.
中文:TELLME方法通过提升大语言模型的透明度,有效辅助监测其不当行为,在可信任务中显著提高性能,并通过最优传输理论验证了其泛化能力的增强。
English: The TELLME method enhances large language models' transparency and monitoring effectiveness, demonstrating improved performance in trustworthiness tasks and generalization ability through theoretical analysis.

Authors:Neetu R. R, Ozan Alp Topal, Özlem Tuğfe Demir, Emil Björnson, Cicek Cavdar, Gourab Ghatak, Vivek Ashok Bohara
Title: UAV-Based Cell-Free Massive MIMO: Joint Placement and Power Optimization under Fronthaul Capacity Limitations
Abstract:
We consider a cell-free massive multiple-input multiple-output (mMIMO) network, where unmanned aerial vehicles (UAVs) equipped with multiple antennas serve as distributed UAV-access points (UAV-APs). These UAV-APs provide seamless coverage by jointly serving user equipments (UEs) with out predefined cell boundaries. However, high-capacity wireless networks face significant challenges due to fronthaul limitations in UAV-assisted architectures. This letter proposes a novel UAV-based cell-free mMIMO framework that leverages distributed UAV-APs to serve UEs while addressing the capacity constraints of wireless fronthaul links. We evaluate functional split Options 7.2 and 8 for the fronthaul links, aiming to maximize the minimum signal-to-interference-plus-noise ratio (SINR) among the UEs and minimize the power consumption by optimizing the transmit powers of UAV-APs and selectively activating them. Our analysis compares sub-6 GHz and millimeter wave (mmWave) bands for the fronthaul, showing that mmWave achieves superior SINR with lower power consumption, particularly under Option 8. Additionally, we determine the minimum fronthaul bandwidth required to activate a single UAV-AP under different split options.
中文: 本文提出了一种基于无人机的无蜂窝大规模MIMO新框架,通过分布式无人机接入点服务用户并解决无线前传容量限制,研究表明毫米波频段结合选项8能比6GHz以下频段实现更优信干噪比和更低功耗。
English: This letter introduces a novel cell-free massive MIMO framework using distributed UAV access points to serve users while overcoming wireless fronthaul capacity constraints, demonstrating that millimeter wave bands with Option 8 achieve better SINR and lower power consumption than sub-6 GHz bands.

Authors:Tobias Dietz, Brian B. Moser, Tobias Nauen, Federico Raue, Stanislav Frolov, Andreas Dengel
Title: A Study in Dataset Distillation for Image Super-Resolution
Abstract:
Dataset distillation is the concept of condensing large datasets into smaller but highly representative synthetic samples. While previous research has primarily focused on image classification, its application to image Super-Resolution (SR) remains underexplored. This exploratory work studies multiple dataset distillation techniques applied to SR, including pixel- and latent-space approaches under different aspects. Our experiments demonstrate that a 91.12% dataset size reduction can be achieved while maintaining comparable SR performance to the full dataset. We further analyze initialization strategies and distillation methods to optimize memory efficiency and computational costs. Our findings provide new insights into dataset distillation for SR and set the stage for future advancements.
中文: 数据集蒸馏可将图像超分辨率任务的数据集规模缩减91.12%且保持性能,通过像素与潜在空间方法及初始化策略优化了内存与计算效率。
English: Dataset distillation effectively reduces dataset size by over 91% for image super-resolution while preserving performance, exploring various techniques and initialization strategies to enhance efficiency.

Authors:Yunlong Yuan, Yuanfan Guo, Chunwei Wang, Wei Zhang, Hang Xu, Li Zhang
Title: FreqPrior: Improving Video Diffusion Models with Frequency Filtering Gaussian Noise
Abstract:
Text-driven video generation has advanced significantly due to developments in diffusion models. Beyond the training and sampling phases, recent studies have investigated noise priors of diffusion models, as improved noise priors yield better generation results. One recent approach employs the Fourier transform to manipulate noise, marking the initial exploration of frequency operations in this context. However, it often generates videos that lack motion dynamics and imaging details. In this work, we provide a comprehensive theoretical analysis of the variance decay issue present in existing methods, contributing to the loss of details and motion dynamics. Recognizing the critical impact of noise distribution on generation quality, we introduce FreqPrior, a novel noise initialization strategy that refines noise in the frequency domain. Our method features a novel filtering technique designed to address different frequency signals while maintaining the noise prior distribution that closely approximates a standard Gaussian distribution. Additionally, we propose a partial sampling process by perturbing the latent at an intermediate timestep during finding the noise prior, significantly reducing inference time without compromising quality. Extensive experiments on VBench demonstrate that our method achieves the highest scores in both quality and semantic assessments, resulting in the best overall total score. These results highlight the superiority of our proposed noise prior.
中文摘要:本文提出FreqPrior,一种新颖的噪声初始化策略,通过在频域优化噪声来解决现有方法中的方差衰减问题,从而提升视频生成的运动动态和成像细节,同时显著减少推理时间。
English Summary: This paper introduces FreqPrior, a novel noise initialization strategy that enhances text-driven video generation by refining noise in the frequency domain, addressing variance decay issues to improve motion dynamics and imaging details while reducing inference time.

Authors:Evan Chen, Frank Po-Chen Lin, Dong-Jun Han, Christopher G. Brinton
Title: Differentially-Private Multi-Tier Federated Learning: A Formal Analysis and Evaluation
Abstract:
While federated learning (FL) eliminates the transmission of raw data over a network, it is still vulnerable to privacy breaches from the communicated model parameters. Differential privacy (DP) is often employed to address such issues. However, the impact of DP on FL in multi-tier networks -- where hierarchical aggregations couple noise injection decisions at different tiers, and trust models are heterogeneous across subnetworks -- is not well understood. To fill this gap, we develop \underline{M}ulti-Tier \underline{F}ederated Learning with \underline{M}ulti-Tier \underline{D}ifferential \underline{P}rivacy ({\tt M$^2$FDP}), a DP-enhanced FL methodology for jointly optimizing privacy and performance over such networks. One of the key principles of {\tt M$^2$FDP} is to adapt DP noise injection across the established edge/fog computing hierarchy (e.g., edge devices, intermediate nodes, and other tiers up to cloud servers) according to the trust models in different subnetworks. We conduct a comprehensive analysis of the convergence behavior of {\tt M$^2$FDP} under non-convex problem settings, revealing conditions on parameter tuning under which the training process converges sublinearly to a finite stationarity gap that depends on the network hierarchy, trust model, and target privacy level. We show how these relationships can be employed to develop an adaptive control algorithm for {\tt M$^2$FDP} that tunes properties of local model training to minimize energy, latency, and the stationarity gap while meeting desired convergence and privacy criterion. Subsequent numerical evaluations demonstrate that {\tt M$^2$FDP} obtains substantial improvements in these metrics over baselines for different privacy budgets and system configurations.
中文: 提出的M²FDP框架通过在多层级网络中自适应地应用差分隐私,优化隐私保护与性能,在非凸问题设置下确保收敛,同时显著提升能效、延迟和收敛性指标。
English: The proposed M²FDP framework enhances federated learning in multi-tier networks by adaptively applying differential privacy across hierarchical levels, optimizing privacy, performance, and resource efficiency while ensuring convergence under non-convex settings.

Authors:Keyi Zhu, Jiajia Li, Kaixiang Zhang, Chaaran Arunachalam, Siddhartha Bhattacharya, Renfu Lu, Zhaojian Li
Title: Foundation Model-Based Apple Ripeness and Size Estimation for Selective Harvesting
Abstract:
Harvesting is a critical task in the tree fruit industry, demanding extensive manual labor and substantial costs, and exposing workers to potential hazards. Recent advances in automated harvesting offer a promising solution by enabling efficient, cost-effective, and ergonomic fruit picking within tight harvesting windows. However, existing harvesting technologies often indiscriminately harvest all visible and accessible fruits, including those that are unripe or undersized. This study introduces a novel foundation model-based framework for efficient apple ripeness and size estimation. Specifically, we curated two public RGBD-based Fuji apple image datasets, integrating expanded annotations for ripeness ("Ripe" vs. "Unripe") based on fruit color and image capture dates. The resulting comprehensive dataset, Fuji-Ripeness-Size Dataset, includes 4,027 images and 16,257 annotated apples with ripeness and size labels. Using Grounding-DINO, a language-model-based object detector, we achieved robust apple detection and ripeness classification, outperforming other state-of-the-art models. Additionally, we developed and evaluated six size estimation algorithms, selecting the one with the lowest error and variation for optimal performance. The Fuji-Ripeness-Size Dataset and the apple detection and size estimation algorithms are made publicly available, which provides valuable benchmarks for future studies in automated and selective harvesting.
中文摘要:本研究提出了一种基于基础模型的框架,通过RGBD图像实现苹果成熟度和尺寸的精准评估,在检测与分类性能上表现优异,同时公开数据集和算法以推动选择性自动化采摘技术的发展。
English Summary: This study presents a foundation model-based framework for accurate apple ripeness and size estimation using RGBD images, achieving superior detection and classification performance while making the dataset and algorithms publicly available to advance selective automated harvesting.

Authors:David D. Baek, Ziming Liu, Riya Tyagi, Max Tegmark
Title: Harmonic Loss Trains Interpretable AI Models
Abstract:
In this paper, we introduce harmonic loss as an alternative supervisory signal for training neural networks and large language models (LLMs). Harmonic loss differs from standard cross-entropy loss by (a) replacing the usual SoftMax normalization with a scale-invariant HarMax function and (b) computing logits via Euclidean distance rather than a dot product. Harmonic loss enables improved interpretability and faster convergence, owing to its scale invariance and finite convergence point by design, which can be interpreted as a class center. We first validate the performance of harmonic models across algorithmic, vision, and language datasets. Through extensive experiments, we demonstrate that models trained with harmonic loss perform better than standard models by: (a) enhancing interpretability, (b) requiring less data for generalization, and (c) reducing grokking. Moreover, we compare a GPT-2 model trained with harmonic loss to the standard GPT-2, illustrating that the harmonic model develops more interpretable representations. Looking forward, we believe harmonic loss may become a valuable tool in domains with limited data availability or in high-stakes applications where interpretability and reliability are paramount, paving the way for more robust and efficient neural network models.
中文: 本文提出谐波损失作为神经网络和大语言模型的新型训练方法,通过采用尺度不变的HarMax归一化和基于欧氏距离的logits,增强了模型可解释性、加速收敛,并以更少数据实现更好性能。
English: This paper proposes harmonic loss as a novel training method for neural networks and LLMs, which enhances interpretability, accelerates convergence, and improves performance with less data by using scale-invariant HarMax normalization and Euclidean distance-based logits.

Authors:Wenhao Li, Yue Lin, Xiangfeng Wang, Bo Jin, Hongyuan Zha, Baoxiang Wang
Title: Verbalized Bayesian Persuasion
Abstract:
Information design (ID) explores how a sender influence the optimal behavior of receivers to achieve specific objectives. While ID originates from everyday human communication, existing game-theoretic and machine learning methods often model information structures as numbers, which limits many applications to toy games. This work leverages LLMs and proposes a verbalized framework in Bayesian persuasion (BP), which extends classic BP to real-world games involving human dialogues for the first time. Specifically, we map the BP to a verbalized mediator-augmented extensive-form game, where LLMs instantiate the sender and receiver. To efficiently solve the verbalized game, we propose a generalized equilibrium-finding algorithm combining LLM and game solver. The algorithm is reinforced with techniques including verbalized commitment assumptions, verbalized obedience constraints, and information obfuscation. Numerical experiments in dialogue scenarios, such as recommendation letters, courtroom interactions, and law enforcement, validate that our framework can both reproduce theoretical results in classic BP and discover effective persuasion strategies in more complex natural language and multi-stage scenarios.
中文: 本研究提出了一种基于大语言模型的口头化贝叶斯劝说框架,将博弈论方法扩展到现实世界的人类对话中,并通过推荐信和法庭互动等场景验证了其有效性。
English: This study introduces a verbalized Bayesian persuasion framework using LLMs to extend game-theoretic approaches to real-world human dialogues, validated through scenarios like recommendation letters and courtroom interactions.

Authors:Zhizhen Zhang, Lei Zhu, Zhen Fang, Zi Huang, Yadan Luo
Title: Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
Abstract:
Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time contrastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames. Extensive imitation learning experiments on both simulated and real robots show that the pretrained features significantly enhance downstream manipulation tasks with high robustness to different linguistic styles of instructions, offering a viable pathway toward generalized embodied agents.
中文摘要:提出的动作时序连贯学习(AcTOL)方法通过帧间对比和局部约束学习有序连续的视觉语言表征,有效解决了先前方法依赖终点帧的局限,显著提升了机器人操作任务在不同语言指令下的性能。
English Summary: The proposed Action Temporal Coherence Learning (AcTOL) method addresses limitations in prior vision-language pre-training by learning ordered and continuous representations through frame contrast and local constraints, significantly improving robotic manipulation tasks across varied instructions.

Authors:Jamshid Mozafari, Bhawna Piryani, Abdelrahman Abdallah, Adam Jatowt
Title: HintEval: A Comprehensive Framework for Hint Generation and Evaluation for Questions
Abstract:
Large Language Models (LLMs) are transforming how people find information, and many users turn nowadays to chatbots to obtain answers to their questions. Despite the instant access to abundant information that LLMs offer, it is still important to promote critical thinking and problem-solving skills. Automatic hint generation is a new task that aims to support humans in answering questions by themselves by creating hints that guide users toward answers without directly revealing them. In this context, hint evaluation focuses on measuring the quality of hints, helping to improve the hint generation approaches. However, resources for hint research are currently spanning different formats and datasets, while the evaluation tools are missing or incompatible, making it hard for researchers to compare and test their models. To overcome these challenges, we introduce HintEval, a Python library that makes it easy to access diverse datasets and provides multiple approaches to generate and evaluate hints. HintEval aggregates the scattered resources into a single toolkit that supports a range of research goals and enables a clear, multi-faceted, and reliable evaluation. The proposed library also includes detailed online documentation, helping users quickly explore its features and get started. By reducing barriers to entry and encouraging consistent evaluation practices, HintEval offers a major step forward for facilitating hint generation and analysis research within the NLP/IR community.
中文: 大型语言模型(LLMs)正在改变信息获取方式,但培养批判性思维仍至关重要,为此推出了HintEval这一Python库,它整合了提示生成与评估的资源及工具,以支持自主解决问题。
English: Large Language Models (LLMs) are revolutionizing information access, but fostering critical thinking remains essential, leading to the development of HintEval, a Python library that consolidates resources and tools for generating and evaluating hints to support independent problem-solving.

Authors:Junpeng Wang, Chin-Chia Michael Yeh, Uday Singh Saini, Mahashweta Das
Title: Visual Attention Exploration in Vision-Based Mamba Models
Abstract:
State space models (SSMs) have emerged as an efficient alternative to transformer-based models, offering linear complexity that scales better than transformers. One of the latest advances in SSMs, Mamba, introduces a selective scan mechanism that assigns trainable weights to input tokens, effectively mimicking the attention mechanism. Mamba has also been successfully extended to the vision domain by decomposing 2D images into smaller patches and arranging them as 1D sequences. However, it remains unclear how these patches interact with (or attend to) each other in relation to their original 2D spatial location. Additionally, the order used to arrange the patches into a sequence also significantly impacts their attention distribution. To better understand the attention between patches and explore the attention patterns, we introduce a visual analytics tool specifically designed for vision-based Mamba models. This tool enables a deeper understanding of how attention is distributed across patches in different Mamba blocks and how it evolves throughout a Mamba model. Using the tool, we also investigate the impact of different patch-ordering strategies on the learned attention, offering further insights into the model's behavior.
中文: Mamba作为一种选择性扫描的状态空间模型,已应用于视觉领域,但图像块间的交互与排序影响尚不明确,为此开发了可视化分析工具来探究注意力分布及其演变规律。
English: Mamba, a state space model with selective scanning, has been adapted for vision tasks but lacks clarity on patch interactions and ordering effects, prompting the development of a visual analytics tool to analyze attention patterns and strategies.

Authors:Hu Wang, Ibrahim Almakky, Congbo Ma, Numan Saeed, Mohammad Yaqub
Title: In-Model Merging for Enhancing the Robustness of Medical Imaging Classification Models
Abstract:
Model merging is an effective strategy to merge multiple models for enhancing model performances, and more efficient than ensemble learning as it will not introduce extra computation into inference. However, limited research explores if the merging process can occur within one model and enhance the model's robustness, which is particularly critical in the medical image domain. In the paper, we are the first to propose in-model merging (InMerge), a novel approach that enhances the model's robustness by selectively merging similar convolutional kernels in the deep layers of a single convolutional neural network (CNN) during the training process for classification. We also analytically reveal important characteristics that affect how in-model merging should be performed, serving as an insightful reference for the community. We demonstrate the feasibility and effectiveness of this technique for different CNN architectures on 4 prevalent datasets. The proposed InMerge-trained model surpasses the typically-trained model by a substantial margin. The code will be made public.
Chinese: 本文提出了一种新颖的模型内融合方法InMerge,通过在训练过程中选择性合并单个卷积神经网络中的相似卷积核来增强模型鲁棒性,并在多种架构和数据集上证明了其显著性能提升。
English: This paper introduces InMerge, a novel in-model merging technique that enhances the robustness of a single CNN by selectively merging similar convolutional kernels during training, demonstrating significant performance improvements across various architectures and datasets.

Authors:Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen
Title: Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Abstract:
Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k\times k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$\times$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$\times$ faster than the previous best-performing model without relying on vision foundation modules (e.g., DINOv2) or advanced guidance interval sampling.
中文: xAR框架通过将预测单元扩展为灵活的实体并采用噪声上下文学习来缓解曝光偏差,在图像生成任务中实现了最先进的性能且推理速度显著提升。
English: The xAR framework generalizes autoregressive modeling by allowing flexible prediction units called entities and mitigates exposure bias through noisy context learning, achieving state-of-the-art image generation performance with significantly faster inference.

Authors:Yiwei Li, Ji Zhang, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Title: Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation
Abstract:
Self-consistency improves reasoning by aggregating diverse stochastic samples, yet the dynamics behind its efficacy remain underexplored. We reframe self-consistency as a dynamic distributional alignment problem, revealing that decoding temperature not only governs sampling randomness but also actively shapes the latent answer distribution. Given that high temperatures require prohibitively large sample sizes to stabilize, while low temperatures risk amplifying biases, we propose a confidence-driven mechanism that dynamically calibrates temperature: sharpening the sampling distribution under uncertainty to align with high-probability modes, and promoting exploration when confidence is high. Experiments on mathematical reasoning tasks show this approach outperforms fixed-diversity baselines under limited samples, improving both average and best-case performance across varying initial temperatures without additional data or modules. This establishes self-consistency as a synchronization challenge between sampling dynamics and evolving answer distributions.
Chinese: 本研究将自一致性重新定义为动态分布对齐问题,提出基于置信度的温度校准机制——在不确定时锐化采样分布以对齐高概率模式,在置信度高时促进探索,在数学推理任务中仅用有限样本就超越了固定多样性基线,且无需额外数据或模块。
English: This research reframes self-consistency as a dynamic distributional alignment challenge and introduces a confidence-driven temperature calibration mechanism that sharpens sampling under uncertainty while promoting exploration when confident, achieving superior performance in mathematical reasoning tasks with limited samples without extra data or modules.

Authors:Zhihao Shi, Jie Wang, Zhiwei Zhuang, Xize Liang, Bin Li, Feng Wu
Title: Accurate and Scalable Graph Neural Networks via Message Invariance
Abstract:
Message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. For a sampled mini-batch of target nodes, the message passing process is divided into two parts: message passing between nodes within the batch (MP-IB) and message passing from nodes outside the batch to those within it (MP-OB). However, MP-OB recursively relies on higher-order out-of-batch neighbors, leading to an exponentially growing computational cost with respect to the number of layers. Due to the neighbor explosion, the whole message passing stores most nodes and edges on the GPU such that many GNNs are infeasible to large-scale graphs. To address this challenge, we propose an accurate and fast mini-batch approach for large graph transductive learning, namely topological compensation (TOP), which obtains the outputs of the whole message passing solely through MP-IB, without the costly MP-OB. The major pillar of TOP is a novel concept of message invariance, which defines message-invariant transformations to convert costly MP-OB into fast MP-IB. This ensures that the modified MP-IB has the same output as the whole message passing. Experiments demonstrate that TOP is significantly faster than existing mini-batch methods by order of magnitude on vast graphs (millions of nodes and billions of edges) with limited accuracy degradation.
Chinese: 所提出的TOP方法通过消息不变性变换将代价高昂的批外消息传递转化为快速的批内处理,实现了大规模图上半监督学习的高效扩展,且精度损失极小。
English: The proposed TOP method efficiently converts costly out-of-batch message passing into fast in-batch processing using message-invariant transformations, enabling scalable transductive learning on large graphs with minimal accuracy loss.

Authors:Lei Zhao, Sizhou Chen, Linfeng Feng, Jichao Zhang, Xiao-Lei Zhang, Chi Zhang, Xuelong Li
Title: DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model
Abstract:
Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec. Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text description is needed to generate spatial audio. Particularly, to improve the synthesis quality and azimuth accuracy of the spatial sound events simultaneously, we propose to use two kinds of acoustic features. One is the Mel spectrograms which is good for improving the synthesis quality, and the other is the short-time Fourier transform spectrograms which is good at improving the azimuth accuracy. We provide a pipeline of constructing spatial audio dataset with text prompts, for the training of the VAEs and diffusion model. We also introduce new spatial-aware evaluation metrics to quantify the azimuth errors of the generated spatial audio recordings. Experimental results demonstrate that the proposed method can generate spatial audio with high directional and event consistency.
中文摘要:提出的DualSpec框架通过结合变分自编码器和扩散模型,填补了文本到空间音频生成的空白,同时利用梅尔谱和短时傅里叶变换谱来提升音频质量与方位精度。
English Summary: The proposed DualSpec framework addresses the gap in text-to-spatial-audio generation by combining variational autoencoders with a diffusion model, utilizing both Mel and STFT spectrograms to enhance audio quality and directional accuracy.

Authors:Weiyan Wang, Xingjian Shi, Ruiqi Shu, Yuan Gao, Rui Ray Chen, Kun Wang, Fan Xu, Jinbao Xue, Shuaipeng Li, Yangyu Tao, Di Wang, Hao Wu, Xiaomeng Huang
Title: BeamVQ: Beam Search with Vector Quantization to Mitigate Data Scarcity in Physical Spatiotemporal Forecasting
Abstract:
In practice, physical spatiotemporal forecasting can suffer from data scarcity, because collecting large-scale data is non-trivial, especially for extreme events. Hence, we propose \method{}, a novel probabilistic framework to realize iterative self-training with new self-ensemble strategies, achieving better physical consistency and generalization on extreme events. Following any base forecasting model, we can encode its deterministic outputs into a latent space and retrieve multiple codebook entries to generate probabilistic outputs. Then BeamVQ extends the beam search from discrete spaces to the continuous state spaces in this field. We can further employ domain-specific metrics (e.g., Critical Success Index for extreme events) to filter out the top-k candidates and develop the new self-ensemble strategy by combining the high-quality candidates. The self-ensemble can not only improve the inference quality and robustness but also iteratively augment the training datasets during continuous self-training. Consequently, BeamVQ realizes the exploration of rare but critical phenomena beyond the original dataset. Comprehensive experiments on different benchmarks and backbones show that BeamVQ consistently reduces forecasting MSE (up to 39%), enhancing extreme events detection and proving its effectiveness in handling data scarcity.
中文:BeamVQ框架通过新颖的自集成策略实现迭代自训练,有效应对时空预测中的数据稀缺问题,在提高物理一致性的同时将预测误差降低高达39%,并显著增强了极端事件的检测能力。
English: The proposed BeamVQ framework addresses data scarcity in spatiotemporal forecasting through iterative self-training with novel self-ensemble strategies, improving physical consistency and reducing forecasting errors by up to 39% while enhancing extreme event detection.

Authors:Hongye Jin, Pei Chen, Jingfeng Yang, Zhengyang Wang, Meng Jiang, Yifan Gao, Binxuan Huang, Xinyang Zhang, Zheng Li, Tianyi Liu, Huasheng Li, Bing Yin
Title: END: Early Noise Dropping for Efficient and Effective Context Denoising
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.
Chinese: 本研究提出了早期噪声丢弃(END)方法,利用大语言模型的早期层识别并剔除无关输入片段,无需微调即可提升模型性能和效率。
English: The study introduces Early Noise Dropping (END), a method that identifies and discards irrelevant input chunks using early LLM layers, enhancing performance and efficiency without fine-tuning.

Authors:Xiaohua Wu, Xiaohui Tao, Wenjie Wu, Yuefeng Li, Lin Li
Title: Random Forest-of-Thoughts: Uncertainty-aware Reasoning for Computational Social Science
Abstract:
Social surveys in computational social science are well-designed by elaborate domain theories that can effectively reflect the interviewee's deep thoughts without concealing their true feelings. The candidate questionnaire options highly depend on the interviewee's previous answer, which results in the complexity of social survey analysis, the time, and the expertise required. The ability of large language models (LLMs) to perform complex reasoning is well-enhanced by prompting learning such as Chain-of-thought (CoT) but still confined to left-to-right decision-making processes or limited paths during inference. This means they can fall short in problems that require exploration and uncertainty searching. In response, a novel large language model prompting method, called Random Forest of Thoughts (RFoT), is proposed for generating uncertainty reasoning to fit the area of computational social science. The RFoT allows LLMs to perform deliberate decision-making by generating diverse thought space and randomly selecting the sub-thoughts to build the forest of thoughts. It can extend the exploration and prediction of overall performance, benefiting from the extensive research space of response. The method is applied to optimize computational social science analysis on two datasets covering a spectrum of social survey analysis problems. Our experiments show that RFoT significantly enhances language models' abilities on two novel social survey analysis problems requiring non-trivial reasoning.
中文摘要:提出的“思维随机森林”(RFoT)方法通过生成多样化思维路径,增强大语言模型在计算社会科学调查中的推理能力,实现更优的探索与不确定性处理。
English Summary: The proposed Random Forest of Thoughts (RFoT) method enhances large language models' reasoning capabilities by generating diverse thought paths for improved exploration and uncertainty handling in computational social science surveys.

Authors:Chengkun Cai, Haoliang Liu, Xu Zhao, Zhongyu Jiang, Tianfang Zhang, Zongkai Wu, John Lee, Jenq-Neng Hwang, Lei Li
Title: Bayesian Optimization for Controlled Image Editing via LLMs
Abstract:
In the rapidly evolving field of image generation, achieving precise control over generated content and maintaining semantic consistency remain significant limitations, particularly concerning grounding techniques and the necessity for model fine-tuning. To address these challenges, we propose BayesGenie, an off-the-shelf approach that integrates Large Language Models (LLMs) with Bayesian Optimization to facilitate precise and user-friendly image editing. Our method enables users to modify images through natural language descriptions without manual area marking, while preserving the original image's semantic integrity. Unlike existing techniques that require extensive pre-training or fine-tuning, our approach demonstrates remarkable adaptability across various LLMs through its model-agnostic design. BayesGenie employs an adapted Bayesian optimization strategy to automatically refine the inference process parameters, achieving high-precision image editing with minimal user intervention. Through extensive experiments across diverse scenarios, we demonstrate that our framework significantly outperforms existing methods in both editing accuracy and semantic preservation, as validated using different LLMs including Claude3 and GPT-4.
中文: BayesGenie提出了一种与模型无关的方法,通过将大型语言模型与贝叶斯优化相结合,无需微调即可实现基于自然语言的精确图像编辑,在不同场景下显著提升了编辑准确性和语义保持能力。
English: BayesGenie introduces a model-agnostic approach that combines Large Language Models with Bayesian Optimization for precise, natural language-based image editing without requiring fine-tuning, significantly enhancing accuracy and semantic consistency across various scenarios.

Authors:Xiaoqing Zhang, Yuhan Liu, Flood Sung, Xiuying Chen, Shuo Shang, Rui Yan
Title: Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement
Abstract:
Code generation is crucial in software engineering for automating the coding process efficiently. While test-time computation methods show promise, they suffer from high latency due to multiple computation rounds. To overcome this, we introduce \textbf{ThinkCoder}, a framework that combines thorough exploration with optimal refinement. The exploration phase diversifies the solution space by searching for potential solutions, followed by a refinement phase that enhances precision. This approach allows us to select the best solution through careful consideration before taking action, avoiding excessive trial and error. To further minimize test-time computation overhead, we introduce preference-driven optimization with Reinforced Self-Training (ReST), which uses exploration trajectories from ThinkCoder to guide LLM's evolution. This approach enhances LLM's exploration efficiency via preference learning, cutting costs while maintaining accuracy. ThinkCoder boosts the performance with a single LLM, excelling on benchmarks like HumanEval and MBPP. Compared to SOTA models, it improves Pass@1 by 3.0\% over MapCoder with just 6.4\% of the computation cost. Against AgentCoder, ThinkCoder achieves a 0.5\% higher Pass@1 after 2 rounds, outperforming AgentCoder's 5 rounds. Additionally, ReST with success trajectories enhances efficiency, allowing models like LLaMA2-7B to achieve competitive results using only 20\% of the computational resources. These results highlight the framework's effectiveness and scalability.
中文: ThinkCoder通过结合探索与优化阶段,在降低计算成本的同时显著提升了代码生成的效率和准确性,在基准测试中以最少的资源消耗实现了卓越性能。
English: ThinkCoder enhances code generation by combining exploration and refinement phases, significantly improving efficiency and accuracy while reducing computational costs, as demonstrated by superior performance on benchmarks with minimal resource usage.

Authors:Sadia Qureshi, Thanveer Shaik, Xiaohui Tao, Haoran Xie, Lin Li, Jianming Yong, Xiaohua Jia
Title: Exploring Incremental Unlearning: Techniques, Challenges, and Future Directions
Abstract:
The growing demand for data privacy in Machine Learning (ML) applications has seen Machine Unlearning (MU) emerge as a critical area of research. As the `right to be forgotten' becomes regulated globally, it is increasingly important to develop mechanisms that delete user data from AI systems while maintaining performance and scalability of these systems. Incremental Unlearning (IU) is a promising MU solution to address the challenges of efficiently removing specific data from ML models without the need for expensive and time-consuming full retraining. This paper presents the various techniques and approaches to IU. It explores the challenges faced in designing and implementing IU mechanisms. Datasets and metrics for evaluating the performance of unlearning techniques are discussed as well. Finally, potential solutions to the IU challenges alongside future research directions are offered. This survey provides valuable insights for researchers and practitioners seeking to understand the current landscape of IU and its potential for enhancing privacy-preserving intelligent systems.
中文: 本文综述了增量遗忘技术,旨在解决如何在遵守隐私法规的同时,高效地从机器学习模型中删除特定数据,并保持系统性能与可扩展性。
English: This paper surveys Incremental Unlearning (IU) techniques, addressing the need to efficiently remove specific data from machine learning models to comply with privacy regulations while maintaining system performance and scalability.

Authors:Mansour Al Ghanim, Jiaqi Xue, Rochana Prih Hastuti, Mengxin Zheng, Yan Solihin, Qian Lou
Title: Evaluating the Robustness and Accuracy of Text Watermarking Under Real-World Cross-Lingual Manipulations
Abstract:
We present a study to benchmark representative watermarking methods in cross-lingual settings. The current literature mainly focuses on the evaluation of watermarking methods for the English language. However, the literature for evaluating watermarking in cross-lingual settings is scarce. This results in overlooking important adversary scenarios in which a cross-lingual adversary could be in, leading to a gray area of practicality over cross-lingual watermarking. In this paper, we evaluate four watermarking methods in four different and vocabulary rich languages. Our experiments investigate the quality of text under different watermarking procedure and the detectability of watermarks with practical translation attack scenarios. Specifically, we investigate practical scenarios that an adversary with cross-lingual knowledge could take, and evaluate whether current watermarking methods are suitable for such scenarios. Finally, from our findings, we draw key insights about watermarking in cross-lingual settings.
中文: 本研究对四种水印方法在多种语言中进行基准测试,评估其在跨语言场景下的表现,填补了该领域现有评估的空白,并探讨了涉及翻译的实际对抗攻击。
English: This study benchmarks four watermarking methods across multiple languages to assess their performance in cross-lingual scenarios, addressing the current lack of evaluation in this area and exploring practical adversary attacks involving translation.

Authors:Yuyi Huang, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, Ailin Tao
Title: Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models
Abstract:
Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content, which poses severe societal risks. We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content. These strategies, inspired by psychological phenomena such as the "Priming Effect", "Safe Attention Shift", and "Cognitive Dissonance", effectively attack the models' guarding mechanisms. Our experiments achieved an attack success rate (ASR) of 100% on various open-source models, including Meta's Llama-3.2, Google's Gemma-2, Mistral's Mistral-NeMo, Falcon's Falcon-mamba, Apple's DCLM, Microsoft's Phi3, and Qwen's Qwen2.5, among others. Similarly, for closed-source models such as OpenAI's GPT-4o, Google's Gemini-1.5, and Claude-3.5, we observed an ASR of at least 95% on the AdvBench dataset, which represents the current state-of-the-art. This study underscores the urgent need to reassess the use of generative models in critical applications to mitigate potential adverse societal impacts.
中文摘要:本研究通过展示新型攻击策略,在开源和闭源大语言模型上均实现了近乎完美的安全机制绕过成功率,揭示了模型存在的严重漏洞,强调亟需重新评估其在敏感应用中的部署。
English Summary: This study reveals critical vulnerabilities in large language models by demonstrating novel attack strategies that achieve near-perfect success rates in bypassing safety mechanisms across both open-source and closed-source models, highlighting the urgent need to reassess their deployment in sensitive applications.

Authors:Junhao Du, Chuqin Zhou, Ning Cao, Gang Chen, Yunuo Chen, Zhengxue Cheng, Li Song, Guo Lu, Wenjun Zhang
Title: Large Language Model for Lossless Image Compression with Visual Prompts
Abstract:
Recent advancements in deep learning have driven significant progress in lossless image compression. With the emergence of Large Language Models (LLMs), preliminary attempts have been made to leverage the extensive prior knowledge embedded in these pretrained models to enhance lossless image compression, particularly by improving the entropy model. However, a significant challenge remains in bridging the gap between the textual prior knowledge within LLMs and lossless image compression. To tackle this challenge and unlock the potential of LLMs, this paper introduces a novel paradigm for lossless image compression that incorporates LLMs with visual prompts. Specifically, we first generate a lossy reconstruction of the input image as visual prompts, from which we extract features to serve as visual embeddings for the LLM. The residual between the original image and the lossy reconstruction is then fed into the LLM along with these visual embeddings, enabling the LLM to function as an entropy model to predict the probability distribution of the residual. Extensive experiments on multiple benchmark datasets demonstrate our method achieves state-of-the-art compression performance, surpassing both traditional and learning-based lossless image codecs. Furthermore, our approach can be easily extended to images from other domains, such as medical and screen content images, achieving impressive performance. These results highlight the potential of LLMs for lossless image compression and may inspire further research in related directions.
中文摘要:本文提出了一种融合大语言模型与视觉提示的新型无损图像压缩方法,通过将LLM作为残差概率预测的熵模型,实现了最先进的压缩性能并展示了跨领域应用的潜力。
English Summary: This paper introduces a novel lossless image compression method that integrates Large Language Models with visual prompts, achieving state-of-the-art performance by using LLMs as entropy models for residual probability prediction.

Authors:Xueran Han, Yuhan Liu, Mingzhe Li, Wei Liu, Sen Hu, Rui Yan, Zhiqiang Xu, Xiuying Chen
Title: Pastiche Novel Generation Creating: Fan Fiction You Love in Your Favorite Author's Style
Abstract:
Great novels create immersive worlds with rich character arcs, well-structured plots, and nuanced writing styles. However, current novel generation methods often rely on brief, simplistic story outlines and generate details using plain, generic language. To bridge this gap, we introduce the task of Pastiche Novel Generation, which requires the generated novels to imitate the distinctive features of the original work, including understanding character profiles, predicting plausible plot developments, and writing concrete details using vivid, expressive language. To achieve this, we propose WriterAgent, a novel generation system designed to master the core aspects of literary pastiche. WriterAgent is trained through a curriculum learning paradigm, progressing from low-level stylistic mastery to high-level narrative coherence. Its key tasks include language style learning, character modeling, plot planning, and stylish writing, ensuring comprehensive narrative control. To support this, WriterAgent leverages the WriterLoRA framework, an extension of LoRA with hierarchical and cumulative task-specific modules, each specializing in a different narrative aspect. We evaluate WriterAgent on multilingual classics like Harry Potter and Dream of the Red Chamber, demonstrating its superiority over baselines in capturing the target author's settings, character dynamics, and writing style to produce coherent, faithful narratives.
中文摘要:本文提出仿作小说生成任务,要求生成的小说模仿原著的独特特征,并设计了通过课程学习训练的WriterAgent系统,其分层模块在《哈利·波特》《红楼梦》等多语言经典作品中,比基线模型更能准确还原作者设定、人物关系和叙事连贯性。
English Summary: The paper introduces Pastiche Novel Generation, a task requiring novels to imitate original works' distinctive features, and proposes WriterAgent—a system trained through curriculum learning with specialized modules—which outperforms baselines in capturing authorial style, character dynamics, and narrative coherence across multilingual classics.

Authors:Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du, Nan Guan, Chun Jason Xue
Title: When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models
Abstract:
Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.
中文: 本文提出了一种压缩框架,通过采用压缩感知量化和剪枝技术,进一步减少量化后大语言模型的内存占用,实现了40%的内存降低,同时对精度和速度影响甚微。
English: This paper introduces a compression framework that further reduces the memory footprint of quantized large language models by employing compression-aware quantization and pruning, achieving a 40% memory reduction with minimal impact on accuracy and speed.

Authors:Chengyan Ma, Ruidong Han, Jieke Shi, Ye Liu, Yuqing Niu, Di Lu, Chuang Tian, Jianfeng Ma, Debin Gao, David Lo
Title: DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications
Abstract:
Trusted Execution Environment (TEE) enhances the security of mobile applications and cloud services by isolating sensitive code in the secure world from the non-secure normal world. However, TEE applications are still confronted with vulnerabilities stemming from bad partitioning. Bad partitioning can lead to critical security problems of TEE, such as leaking sensitive data to the normal world or being adversely affected by malicious inputs from the normal world. To address this, we propose an approach to detect partitioning issues in TEE applications. First, we conducted a survey of TEE vulnerabilities caused by bad partitioning and found that the parameters exchanged between the secure and normal worlds often contain insecure usage with bad partitioning implementation. Second, we developed a tool named DITING that can analyze data-flows of these parameters and identify their violations of security rules we defined to find bad partitioning issues. Different from existing research that only focuses on malicious input to TEE, we assess the partitioning issues more comprehensively through input/output and shared memory. Finally, we created the first benchmark targeting bad partitioning, consisting of 110 test cases. Experiments demonstrate that DITING achieves an F1 score of 0.90 in identifying bad partitioning issues.
Chinese: 可信执行环境(TEE)因错误分区面临敏感数据泄露或恶意输入的安全风险,为此开发了DITING工具,能高效检测此类问题,F1分数高达0.90。
English: The Trusted Execution Environment (TEE) faces security risks from bad partitioning, which can leak sensitive data or allow malicious inputs, prompting the development of DITING, a tool that effectively detects these issues with a high F1 score of 0.90.

Authors:Leena Mathur, Marian Qian, Paul Pu Liang, Louis-Philippe Morency
Title: Social Genome: Grounded Social Reasoning Abilities of Multimodal Models
Abstract:
Social reasoning abilities are crucial for AI systems to effectively interpret and respond to multimodal human communication and interaction within social contexts. We introduce SOCIAL GENOME, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models. SOCIAL GENOME contains 272 videos of interactions and 1,486 human-annotated reasoning traces related to inferences about these interactions. These traces contain 5,777 reasoning steps that reference evidence from visual cues, verbal cues, vocal cues, and external knowledge (contextual knowledge external to videos). SOCIAL GENOME is also the first modeling challenge to study external knowledge in social reasoning. SOCIAL GENOME computes metrics to holistically evaluate semantic and structural qualities of model-generated social reasoning traces. We demonstrate the utility of SOCIAL GENOME through experiments with state-of-the-art models, identifying performance gaps and opportunities for future research to improve the grounded social reasoning abilities of multimodal models.
Chinese: SOCIAL GENOME是首个通过272个视频和1,486条人工标注推理轨迹来评估多模态模型细粒度社交推理能力的基准,其综合评估指标揭示了现有模型的性能差距与改进方向。
English: SOCIAL GENOME is the first benchmark for evaluating multimodal models' fine-grained social reasoning abilities using 272 videos and 1,486 human-annotated reasoning traces, revealing performance gaps and research opportunities through holistic evaluation metrics.

Authors:Shangyu Wu, Hongchao Du, Ying Xiong, Shuai Chen, Tei-Wei Kuo, Nan Guan, Chun Jason Xue
Title: EvoP: Robust LLM Inference via Evolutionary Pruning
Abstract:
Large Language Models (LLMs) have achieved remarkable success in natural language processing tasks, but their massive size and computational demands hinder their deployment in resource-constrained environments. Existing model pruning methods address this issue by removing redundant structures (e.g., elements, channels, layers) from the model. However, these methods employ a heuristic pruning strategy, which leads to suboptimal performance. Besides, they also ignore the data characteristics when pruning the model. To overcome these limitations, we propose EvoP, an evolutionary pruning framework for robust LLM inference. EvoP first presents a cluster-based calibration dataset sampling (CCDS) strategy for creating a more diverse calibration dataset. EvoP then introduces an evolutionary pruning pattern searching (EPPS) method to find the optimal pruning pattern. Compared to existing model pruning techniques, EvoP achieves the best performance while maintaining the best efficiency. Experiments across different LLMs and different downstream tasks validate the effectiveness of the proposed EvoP, making it a practical and scalable solution for deploying LLMs in real-world applications.
中文: 提出的EvoP框架通过进化剪枝模式搜索和多样化校准数据克服了启发式剪枝的局限性,在资源受限环境中实现了大语言模型部署的最佳性能和效率。
English: The proposed EvoP framework overcomes limitations of heuristic pruning by using evolutionary pattern searching and diverse calibration data, achieving optimal performance and efficiency for deploying large language models in resource-constrained environments.

Authors:Mingni Tang, Jiajia Li, Lu Yang, Zhiqiang Zhang, Jinghao Tian, Zuchao Li, Lefei Zhang, Ping Wang
Title: NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
Abstract:
Symbolic music is represented in two distinct forms: two-dimensional, visually intuitive score images, and one-dimensional, standardized text annotation sequences. While large language models have shown extraordinary potential in music, current research has primarily focused on unimodal symbol sequence text. Existing general-domain visual language models still lack the ability of music notation understanding. Recognizing this gap, we propose NOTA, the first large-scale comprehensive multimodal music notation dataset. It consists of 1,019,237 records, from 3 regions of the world, and contains 3 tasks. Based on the dataset, we trained NotaGPT, a music notation visual large language model. Specifically, we involve a pre-alignment training phase for cross-modal alignment between the musical notes depicted in music score images and their textual representation in ABC notation. Subsequent training phases focus on foundational music information extraction, followed by training on music notation analysis. Experimental results demonstrate that our NotaGPT-7B achieves significant improvement on music understanding, showcasing the effectiveness of NOTA and the training pipeline. Our datasets are open-sourced at https://huggingface.co/datasets/MYTH-Lab/NOTA-dataset.
中文摘要:本研究提出了首个大规模多模态音乐符号数据集NOTA,并基于此训练了NotaGPT模型,旨在弥合乐谱图像与文本标注之间的鸿沟,显著提升了音乐理解能力。
English Summary: The study introduces NOTA, a large multimodal dataset for music notation, and NotaGPT, a model trained on it to bridge the gap between visual score images and text annotations, achieving enhanced music understanding.

Authors:Pengfei He, Yupin Lin, Shen Dong, Han Xu, Yue Xing, Hui Liu
Title: Red-Teaming LLM Multi-Agent Systems via Communication Attacks
Abstract:
Large Language Model-based Multi-Agent Systems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability. In this work, we introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages. Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi-agent systems by only manipulating the messages passing between agents. To enable the attack under the challenges of limited control and role-restricted communication format, we develop an LLM-powered adversarial agent with a reflection mechanism that generates contextually-aware malicious instructions. Our comprehensive evaluation across various frameworks, communication structures, and real-world applications demonstrates that LLM-MAS is vulnerable to communication-based attacks, highlighting the need for robust security measures in multi-agent systems.
Chinese: 本研究提出了“中间代理”(AiTM)攻击,通过拦截和操纵大型语言模型多智能体系统(LLM-MAS)中的通信消息来利用其安全漏洞,能在不直接攻击单个智能体的情况下危及整个系统安全。
English: The study introduces Agent-in-the-Middle (AiTM), a novel attack that exploits communication vulnerabilities in Large Language Model-based Multi-Agent Systems (LLM-MAS) by intercepting and manipulating messages, compromising entire systems without directly targeting individual agents.

Authors:Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub
Title: FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis
Abstract:
Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.
Chinese: FetalCLIP是一种基于210,035张胎儿超声图像与文本配对数据训练的多模态基础模型,它克服了胎儿超声图像复杂性带来的挑战,在分类、孕周估算和先心病检测等多项任务中均优于现有基准模型,并展现出卓越的泛化能力。
English: FetalCLIP is a vision-language foundation model pre-trained on a large dataset of 210,035 fetal ultrasound images with text, which overcomes the challenges of complex fetal ultrasound data and outperforms all baselines in various applications like classification and CHD detection while demonstrating strong generalizability.

Authors:Juraj Vladika, Ivana Hacajová, Florian Matthes
Title: Step-by-Step Fact Verification System for Medical Claims with Explainable Reasoning
Abstract:
Fact verification (FV) aims to assess the veracity of a claim based on relevant evidence. The traditional approach for automated FV includes a three-part pipeline relying on short evidence snippets and encoder-only inference models. More recent approaches leverage the multi-turn nature of LLMs to address FV as a step-by-step problem where questions inquiring additional context are generated and answered until there is enough information to make a decision. This iterative method makes the verification process rational and explainable. While these methods have been tested for encyclopedic claims, exploration on domain-specific and realistic claims is missing. In this work, we apply an iterative FV system on three medical fact-checking datasets and evaluate it with multiple settings, including different LLMs, external web search, and structured reasoning using logic predicates. We demonstrate improvements in the final performance over traditional approaches and the high potential of step-by-step FV systems for domain-specific claims.
中文: 本研究将迭代式事实核查系统应用于医学数据集,相比传统方法展现出性能提升,并凸显了逐步验证方法在领域特定声明中的潜力。
English: This study applies an iterative fact verification system to medical datasets, demonstrating improved performance over traditional methods and highlighting the potential of step-by-step approaches for domain-specific claims.

Authors:Juraj Vladika, Florian Matthes
Title: On the Influence of Context Size and Model Choice in Retrieval-Augmented Generation Systems
Abstract:
Retrieval-augmented generation (RAG) has emerged as an approach to augment large language models (LLMs) by reducing their reliance on static knowledge and improving answer factuality. RAG retrieves relevant context snippets and generates an answer based on them. Despite its increasing industrial adoption, systematic exploration of RAG components is lacking, particularly regarding the ideal size of provided context, and the choice of base LLM and retrieval method. To help guide development of robust RAG systems, we evaluate various context sizes, BM25 and semantic search as retrievers, and eight base LLMs. Moving away from the usual RAG evaluation with short answers, we explore the more challenging long-form question answering in two domains, where a good answer has to utilize the entire context. Our findings indicate that final QA performance improves steadily with up to 15 snippets but stagnates or declines beyond that. Finally, we show that different general-purpose LLMs excel in the biomedical domain than the encyclopedic one, and that open-domain evidence retrieval in large corpora is challenging.
中文: 检索增强生成(RAG)通过引入检索到的上下文片段来增强大语言模型的答案准确性,研究发现使用最多15个片段效果最佳,且不同领域和检索方法的表现存在显著差异。
English: Retrieval-augmented generation (RAG) enhances large language models by incorporating retrieved context to improve factual accuracy, with optimal performance achieved using up to 15 context snippets and varying effectiveness across domains and retrieval methods.

Authors:Yifu Chen, Shengpeng Ji, Haoxiao Wang, Ziqing Wang, Siyu Chen, Jinzheng He, Jin Xu, Zhou Zhao
Title: WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
Abstract:
Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG's unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.
中文摘要:WavRAG是首个原生支持端到端音频处理的检索增强生成框架,通过直接处理原始音频实现了10倍加速,在保持同等检索性能的同时将RAG能力扩展至音频模态。
English Summary: WavRAG is the first native audio retrieval augmented generation framework that bypasses ASR to process raw audio directly, achieving 10x acceleration while maintaining comparable retrieval performance and extending RAG capabilities to audio modality.

Authors:Shenglai Zeng, Pengfei He, Kai Guo, Tianqi Zheng, Hanqing Lu, Yue Xing, Hui Liu
Title: Towards Context-Robust LLMs: A Gated Representation Fine-tuning Approach
Abstract:
Large Language Models (LLMs) enhanced with external contexts, such as through retrieval-augmented generation (RAG), often face challenges in handling imperfect evidence. They tend to over-rely on external knowledge, making them vulnerable to misleading and unhelpful contexts. To address this, we propose the concept of context-robust LLMs, which can effectively balance internal knowledge with external context, similar to human cognitive processes. Specifically, context-robust LLMs should rely on external context only when lacking internal knowledge, identify contradictions between internal and external knowledge, and disregard unhelpful contexts. To achieve this goal, we introduce Grft, a lightweight and plug-and-play gated representation fine-tuning approach. Grft consists of two key components: a gating mechanism to detect and filter problematic inputs, and low-rank representation adapters to adjust hidden representations. By training a lightweight intervention function with only 0.0004\% of model size on fewer than 200 examples, Grft can effectively adapt LLMs towards context-robust behaviors.
中文: 本研究提出Grft轻量微调方法,通过门控机制和表征适配器使大语言模型能选择性利用外部语境,仅需少量训练即可增强其抗干扰能力。
English: This study introduces Grft, a lightweight fine-tuning method that enhances large language models' robustness by enabling them to selectively use external contexts through gating mechanisms and representation adapters, requiring minimal training data.

Authors:Peiwen Yuan, Yueqi Zhang, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Title: Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation
Abstract:
Evaluating models on large benchmarks is very resource-intensive, especially during the period of rapid model evolution. Existing efficient evaluation methods estimate the performance of target models by testing them only on a small and static coreset of the benchmark, which is derived from the publicly available evaluation results of source models. These methods rely on the assumption that target models have high prediction consistency with source models. However, we demonstrate that it doesn't generalize well in practice. To alleviate the inconsistency issue, we present TailoredBench, a method that conducts customized evaluation tailored to each target model. Specifically, a Global-coreset is first constructed as a probe to identify the most consistent source models for each target model with an adaptive source model selection strategy. Afterwards, a scalable K-Medoids clustering algorithm is proposed to extend the Global-coreset to a tailored Native-coreset for each target model. According to the predictions on Native-coresets, we obtain the performance of target models on the whole benchmark with a calibrated estimation strategy. Comprehensive experiments on 5 benchmarks across over 300 models demonstrate that compared to best performing baselines, TailoredBench achieves an average reduction of 31.4% in MAE of accuracy estimates under the same inference budgets, showcasing strong effectiveness and generalizability.
中文: TailoredBench通过自适应源模型选择和聚类为每个目标模型定制专属评估集,在同等计算预算下将性能评估误差平均降低31.4%,有效解决了传统静态核心集评估的局限性。
English: TailoredBench addresses the limitations of static coreset evaluation by creating customized Native-coresets for each target model through adaptive source model selection and clustering, achieving a 31.4% reduction in estimation error under equivalent inference budgets.

Authors:Peiwen Yuan, Chuyi Tan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Boyuan Pan, Yao Hu, Kan Li
Title: From Sub-Ability Diagnosis to Human-Aligned Generation: Bridging the Gap for Text Length Control via MARKERGEN
Abstract:
Despite the rapid progress of large language models (LLMs), their length-controllable text generation (LCTG) ability remains below expectations, posing a major limitation for practical applications. Existing methods mainly focus on end-to-end training to reinforce adherence to length constraints. However, the lack of decomposition and targeted enhancement of LCTG sub-abilities restricts further progress. To bridge this gap, we conduct a bottom-up decomposition of LCTG sub-abilities with human patterns as reference and perform a detailed error analysis. On this basis, we propose MarkerGen, a simple-yet-effective plug-and-play approach that:(1) mitigates LLM fundamental deficiencies via external tool integration;(2) conducts explicit length modeling with dynamically inserted markers;(3) employs a three-stage generation scheme to better align length constraints while maintaining content quality. Comprehensive experiments demonstrate that MarkerGen significantly improves LCTG across various settings, exhibiting outstanding effectiveness and generalizability.
中文: MarkerGen是一种创新的即插即用方法,通过整合外部工具、使用动态标记进行显式长度建模,并实施三阶段生成过程,显著提升了大型语言模型在各种设置下的长度可控文本生成能力。
English: MarkerGen is an innovative plug-and-play method that enhances length-controllable text generation in LLMs by integrating external tools, using dynamic markers for explicit length modeling, and implementing a three-stage generation process, significantly improving performance across various settings.

Authors:Chaofan Li, Zheng Liu, Jianlyv Chen, Defu Lian, Yingxia Shao
Title: Reinforced Information Retrieval
Abstract:
While retrieval techniques are widely used in practice, they still face significant challenges in cross-domain scenarios. Recently, generation-augmented methods have emerged as a promising solution to this problem. These methods enhance raw queries by incorporating additional information from an LLM-based generator, facilitating more direct retrieval of relevant documents. However, existing methods struggle with highly specialized situations that require extensive domain expertise. To address this problem, we present \textbf{Reinforced-IR}, a novel approach that jointly adapts a pre-trained retriever and generator for precise cross-domain retrieval. A key innovation of Reinforced-IR is its \textbf{Self-Boosting} framework, which enables retriever and generator to learn from each other's feedback. Specifically, the generator is reinforced to generate query augmentations that enhance the retriever's performance, while the retriever is trained to better discriminate the relevant documents identified by the generator. This iterative process allows the end-to-end retrieval performance to be progressively optimized using an unlabeled corpus from the target domain. In our experiment, Reinforced-IR outperforms existing domain adaptation methods by a large margin, leading to substantial improvements in retrieval quality across a wide range of application scenarios.
中文: Reinforced-IR通过其自增强框架使检索器和生成器在相互反馈中协同适应,利用目标领域的无标注数据逐步优化跨领域检索性能。
English: Reinforced-IR introduces a Self-Boosting framework that jointly adapts a retriever and generator through mutual feedback, enabling progressive optimization of cross-domain retrieval performance using unlabeled target domain data.

Authors:Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Title: UniCBE: An Uniformity-driven Comparing Based Evaluation Framework with Unified Multi-Objective Optimization
Abstract:
Human preference plays a significant role in measuring large language models and guiding them to align with human values. Unfortunately, current comparing-based evaluation (CBE) methods typically focus on a single optimization objective, failing to effectively utilize scarce yet valuable preference signals. To address this, we delve into key factors that can enhance the accuracy, convergence, and scalability of CBE: suppressing sampling bias, balancing descending process of uncertainty, and mitigating updating uncertainty. Following the derived guidelines, we propose UniCBE, a unified uniformity-driven CBE framework which simultaneously optimize these core objectives by constructing and integrating three decoupled sampling probability matrices, each designed to ensure uniformity in specific aspects. We further ablate the optimal tuple sampling and preference aggregation strategies to achieve efficient CBE. On the AlpacaEval benchmark, UniCBE saves over 17% of evaluation budgets while achieving a Pearson correlation with ground truth exceeding 0.995, demonstrating excellent accuracy and convergence. In scenarios where new models are continuously introduced, UniCBE can even save over 50% of evaluation costs, highlighting its improved scalability.
中文:UniCBE作为一种统一的评估框架,通过优化抑制偏差和管理不确定性等关键因素,提升了准确性、收敛性和可扩展性,在显著降低评估成本的同时实现了与真实结果的高度相关性。
English: UniCBE is a unified evaluation framework that enhances accuracy, convergence, and scalability by optimizing key factors like bias suppression and uncertainty management, achieving high correlation with ground truth while significantly reducing evaluation costs.

Authors:Ze Liu, Zhengyang Liang, Junjie Zhou, Zheng Liu, Defu Lian
Title: Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval
Abstract:
With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called \textit{Visualized Information Retrieval}, or \textbf{Vis-IR}, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called \textbf{Screenshots}, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create \textbf{VIRA} (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop \textbf{UniSE} (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct \textbf{MVRB} (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our work will be shared with the community, laying a solid foundation for this emerging field.
中文: 本文提出可视化信息检索新范式,通过截图统一多模态数据表示,并贡献VIRA数据集、UniSE检索模型和MVRB基准测试,实验证明其相较现有方法取得显著提升。
English: This paper introduces Visualized Information Retrieval (Vis-IR), a paradigm using screenshots to unify multimodal data, and presents three key contributions: the VIRA dataset, UniSE retrieval models, and the MVRB benchmark, demonstrating significant advancements over existing methods.

Authors:Jiayi Shi, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Huan Ren, Yao Hu, Kan Li
Title: InsBank: Evolving Instruction Subset for Ongoing Alignment
Abstract:
Large language models (LLMs) typically undergo instruction tuning to enhance alignment. Recent studies emphasize that quality and diversity of instruction data are more crucial than quantity, highlighting the need to select diverse, high-quality subsets to reduce training costs. However, how to evolve these selected subsets alongside the development of new instruction data remains insufficiently explored. To achieve LLMs' ongoing alignment, we introduce Instruction Bank (\textbf{InsBank}), a continuously updated repository that integrates the latest valuable instruction data. We further propose Progressive Instruction Bank Evolution (\textbf{PIBE}), a novel framework designed to evolve InsBank effectively and efficiently over time. PIBE employs a gradual data selection strategy to maintain long-term efficiency, leveraging a representation-based diversity score to capture relationships between data points and retain historical information for comprehensive diversity evaluation. This also allows for flexible combination of diversity and quality scores during data selection and ranking. Extensive experiments demonstrate that PIBE significantly outperforms baselines in InsBank evolution and is able to extract budget-specific subsets, demonstrating its effectiveness and adaptability.
中文摘要:本研究提出了持续更新的指令数据存储库InsBank及PIBE框架,该框架通过渐进式数据选择和多样性评分,在保持历史上下文的同时灵活结合质量指标,有效推进大语言模型的持续对齐。
English Summary: The study introduces InsBank, a continuously updated instruction data repository, and the PIBE framework that uses progressive data selection with diversity scoring to efficiently evolve LLM alignment while maintaining historical context and flexibility in combining quality metrics.

Authors:Xin Gu, Yaojie Shen, Chenxi Luo, Tiejian Luo, Yan Huang, Yuewei Lin, Heng Fan, Libo Zhang
Title: Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding
Abstract:
Transformer has attracted increasing interest in STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (\e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy.
Chinese: 提出的TA-STVG模型通过文本引导时序采样和属性感知空间激活模块生成目标相关的对象查询,相比传统零初始化方法能更有效地学习判别性信息,在三个基准测试中实现了最先进的视频定位性能。
English: The proposed Target-Aware Transformer for STVG (TA-STVG) introduces text-guided temporal sampling and attribute-aware spatial activation modules to generate target-specific object queries, significantly improving video grounding performance by enhancing discriminative learning over traditional zero-initialized queries.

Authors:Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev
Title: Towards Watermarking of Open-Source LLMs
Abstract:
While watermarks for closed LLMs have matured and have been included in large-scale deployments, these methods are not applicable to open-source models, which allow users full control over the decoding process. This setting is understudied yet critical, given the rising performance of open-source models. In this work, we lay the foundation for systematic study of open-source LLM watermarking. For the first time, we explicitly formulate key requirements, including durability against common model modifications such as model merging, quantization, or finetuning, and propose a concrete evaluation setup. Given the prevalence of these modifications, durability is crucial for an open-source watermark to be effective. We survey and evaluate existing methods, showing that they are not durable. We also discuss potential ways to improve their durability and highlight remaining challenges. We hope our work enables future progress on this important problem.
中文: 本研究为开源大语言模型水印技术建立了系统性研究框架,强调水印需具备抗模型合并与量化等修改的鲁棒性,在揭示现有方法不足的同时指明了改进方向。
English: This study establishes a framework for watermarking open-source large language models, emphasizing the need for durability against modifications like merging and quantization, and reveals current methods' shortcomings while suggesting future improvements.

Authors:Youngwon Lee, Seung-won Hwang, Ruofan Wu, Feng Yan, Danmei Xu, Moutasem Akkad, Zhewei Yao, Yuxiong He
Title: Agentic Verification for Ambiguous Query Disambiguation
Abstract:
In this work, we tackle the challenge of disambiguating queries in retrieval-augmented generation (RAG) to diverse yet answerable interpretations. State-of-the-arts follow a Diversify-then-Verify (DtV) pipeline, where diverse interpretations are generated by an LLM, later used as search queries to retrieve supporting passages. Such a process may introduce noise in either interpretations or retrieval, particularly in enterprise settings, where LLMs -- trained on static data -- may struggle with domain-specific disambiguations. Thus, a post-hoc verification phase is introduced to prune noises. Our distinction is to unify diversification with verification by incorporating feedback from retriever and generator early on. This joint approach improves both efficiency and robustness by reducing reliance on multiple retrieval and inference steps, which are susceptible to cascading errors. We validate the efficiency and effectiveness of our method, Verified-Diversification with Consolidation (VERDICT), on the widely adopted ASQA benchmark to achieve diverse yet verifiable interpretations. Empirical results show that VERDICT improves grounding-aware F1 score by an average of 23% over the strongest baseline across different backbone LLMs.
Chinese: 本研究提出VERDICT方法,在检索增强生成中整合多样化与验证,通过减少级联错误提高效率和鲁棒性,在ASQA基准测试中实现接地感知F1分数平均提升23%。
English: This study introduces VERDICT, a method that integrates diversification with verification in retrieval-augmented generation to enhance efficiency and robustness by reducing cascading errors, achieving a 23% average improvement in grounding-aware F1 score on the ASQA benchmark.

Authors:Zhuming Wang, Yihao Zheng, Jiarui Li, Yaofei Wu, Yan Huang, Zun Li, Lifang Wu, Liang Wang
Title: VicKAM: Visual Conceptual Knowledge Guided Action Map for Weakly Supervised Group Activity Recognition
Abstract:
Existing weakly supervised group activity recognition methods rely on object detectors or attention mechanisms to capture key areas automatically. However, they overlook the semantic information associated with captured areas, which may adversely affect the recognition performance. In this paper, we propose a novel framework named Visual Conceptual Knowledge Guided Action Map (VicKAM) which effectively captures the locations of individual actions and integrates them with action semantics for weakly supervised group activity recognition.It generates individual action prototypes from training set as visual conceptual knowledge to bridge action semantics and visual representations. Guided by this knowledge, VicKAM produces action maps that indicate the likelihood of each action occurring at various locations, based on image correlation theorem. It further augments individual action maps using group activity related statistical information, representing individual action distribution under different group activities, to establish connections between action maps and specific group activities. The augmented action map is incorporated with action semantic representations for group activity recognition.Extensive experiments on two public benchmarks, the Volleyball and the NBA datasets, demonstrate the effectiveness of our proposed method, even in cases of limited training data. The code will be released later.
中文摘要:提出的VicKAM框架通过生成融合视觉表征与动作语义的动作原型,并利用统计分析生成增强动作图谱,将个体动作与群体活动相关联,从而提升了弱监督群体活动识别的性能。
English Summary: The proposed VicKAM framework enhances weakly supervised group activity recognition by generating action prototypes that integrate visual representations with action semantics, then producing augmented action maps through statistical analysis to connect individual actions with group activities.

Authors:Zhihan Zhang, Shiyang Li, Zixuan Zhang, Xin Liu, Haoming Jiang, Xianfeng Tang, Yifan Gao, Zheng Li, Haodong Wang, Zhaoxuan Tan, Yichuan Li, Qingyu Yin, Bing Yin, Meng Jiang
Title: IHEval: Evaluating Language Models on Following the Instruction Hierarchy
Abstract:
The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.
中文摘要:本研究提出了IHEval基准来评估语言模型对指令层级的遵循能力,发现模型在处理冲突指令时表现显著下降,凸显了未来发展中针对性优化的必要性。
English Summary: The study introduces IHEval, a benchmark to evaluate language models' adherence to the instruction hierarchy, revealing their significant difficulty in prioritizing conflicting instructions and highlighting the need for targeted improvements.

Authors:Ruiran Yan, Zheng Liu, Defu Lian
Title: O1 Embedder: Let Retrievers Think Before Action
Abstract:
The growing power of large language models (LLMs) has revolutionized how people access and utilize information. Notably, the LLMs excel at performing fine-grained data representation, which facilitates precise retrieval of information. They also generate high-quality answers based on external references, enabling the production of useful knowledge. The recent introduction of reasoning models, like OpenAI O1 and DeepSeek R1, marks another leap forward, highlighting LLMs' ability to think progressively before delivering final answers. This breakthrough significantly improves the ability to address complex tasks, e.g., coding and math proofs. Inspired by this progress, we aim to develop similar capabilities for retrieval models, which hold great promise for tackling critical challenges in the field, including multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. With this motivation, we propose a novel approach called O1 Embedder, which generates useful thoughts for the input query before making retrieval for the target documents. To realize this objective, we conquer two technical difficulties. First, we design a data synthesis workflow, creating training signals for O1 Embedder by generating initial thoughts from an LLM-expert and subsequently refining them using a retrieval committee. Second, we optimize the training process, enabling a pre-trained model to be jointly fine-tuned to generate retrieval thoughts via behavior cloning and perform dense retrieval through contrastive learning. Our approach is evaluated by comprehensive experiments, where substantial improvements are achieved across 12 popular datasets, spanning both in-domain and out-of-domain scenarios. These results highlight O1 Embedder's remarkable accuracy and generalizability, paving the way for the development of next-generation IR foundation models.
中文: 大型语言模型提升了信息检索与生成能力,催生了O1 Embedder方法,该方法通过在检索前生成初步思考来优化准确性及泛化性,并在多数据集实验中展现出显著改进效果。
English: Large language models have advanced information retrieval and generation, leading to the development of O1 Embedder, which enhances retrieval accuracy and generalizability by generating preliminary thoughts before document retrieval, as demonstrated by significant improvements across diverse datasets.

Authors:Wenhui Lei, Hanyu Chen, Zitian Zhang, Luyang Luo, Qiong Xiao, Yannian Gu, Peng Gao, Yankai Jiang, Ci Wang, Guangtao Wu, Tongjia Xu, Yingjie Zhang, Xiaofan Zhang, Pranav Rajpurkar, Shaoting Zhang, Zhenning Wang
Title: A Data-Efficient Pan-Tumor Foundation Model for Oncology CT Interpretation
Abstract:
Artificial intelligence-assisted imaging analysis has made substantial strides in tumor diagnosis and management. Here we present PASTA, a pan-tumor CT foundation model that achieves state-of-the-art performance on 45 of 46 representative oncology tasks -- including lesion segmentation, tumor detection in plain CT, tumor staging, survival prediction, structured report generation, and cross-modality transfer learning, significantly outperforming the second-best models on 35 tasks. This remarkable advancement is driven by our development of PASTA-Gen, an innovative synthetic tumor generation framework that produces a comprehensive dataset of 30,000 CT scans with pixel-level annotated lesions and paired structured reports, encompassing malignancies across ten organs and five benign lesion types. By leveraging this rich, high-quality synthetic data, we overcome a longstanding bottleneck in the development of CT foundation models -- specifically, the scarcity of publicly available, high-quality annotated datasets due to privacy constraints and the substantial labor required for scaling precise data annotation. Encouragingly, PASTA demonstrates exceptional data efficiency with promising practical value, markedly improving performance on various tasks with only a small amount of real-world data. The open release of both the synthetic dataset and PASTA foundation model effectively addresses the challenge of data scarcity, thereby advancing oncological research and clinical translation.
中文: PASTA是一种先进的泛肿瘤CT基础模型,在45项肿瘤学任务中表现卓越,其核心是创新的PASTA-Gen合成数据框架,通过生成3万份带标注的CT扫描解决了数据稀缺问题,模型和数据集均已开源以推动癌症研究发展。
English: PASTA is a state-of-the-art pan-tumor CT foundation model that excels in 45 oncology tasks, driven by the innovative PASTA-Gen synthetic data framework which overcomes data scarcity by generating 30,000 annotated CT scans, with both the model and dataset openly released to advance cancer research.

Authors:Jiayi Luo, Qingyun Sun, Haonan Yuan, Xingcheng Fu, Jianxin Li
Title: Robust Graph Learning Against Adversarial Evasion Attacks via Prior-Free Diffusion-Based Structure Purification
Abstract:
Adversarial evasion attacks pose significant threats to graph learning, with lines of studies that have improved the robustness of Graph Neural Networks (GNNs). However, existing works rely on priors about clean graphs or attacking strategies, which are often heuristic and inconsistent. To achieve robust graph learning over different types of evasion attacks and diverse datasets, we investigate this problem from a prior-free structure purification perspective. Specifically, we propose a novel Diffusion-based Structure Purification framework named DiffSP, which creatively incorporates the graph diffusion model to learn intrinsic distributions of clean graphs and purify the perturbed structures by removing adversaries under the direction of the captured predictive patterns without relying on priors. DiffSP is divided into the forward diffusion process and the reverse denoising process, during which structure purification is achieved. To avoid valuable information loss during the forward process, we propose an LID-driven nonisotropic diffusion mechanism to selectively inject noise anisotropically. To promote semantic alignment between the clean graph and the purified graph generated during the reverse process, we reduce the generation uncertainty by the proposed graph transfer entropy guided denoising mechanism. Extensive experiments demonstrate the superior robustness of DiffSP against evasion attacks.
中文摘要:提出的DiffSP框架采用基于扩散的结构净化方法,通过各向异性噪声注入和熵引导去噪机制,在不依赖先验知识的情况下增强图神经网络对规避攻击的鲁棒性,同时保持图结构的完整性。
English Summary: The proposed DiffSP framework utilizes a diffusion-based structure purification approach to enhance Graph Neural Networks' robustness against evasion attacks without relying on prior knowledge, employing anisotropic noise injection and entropy-guided denoising to maintain graph integrity.

Authors:Qijun Gan, Yi Ren, Chen Zhang, Zhenhui Ye, Pan Xie, Xiang Yin, Zehuan Yuan, Bingyue Peng, Jianke Zhu
Title: HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation
Abstract:
Human motion video generation has advanced significantly, while existing methods still struggle with accurately rendering detailed body parts like hands and faces, especially in long sequences and intricate motions. Current approaches also rely on fixed resolution and struggle to maintain visual consistency. To address these limitations, we propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a large and wild dataset containing 14,000 hours of high-quality video to produce high-fidelity videos with fine-grained body rendering. Specifically, (i) HumanDiT, built on DiT, supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation; (ii) we introduce a prefix-latent reference strategy to maintain personalized characteristics across extended sequences. Furthermore, during inference, HumanDiT leverages Keypoint-DiT to generate subsequent pose sequences, facilitating video continuation from static images or existing videos. It also utilizes a Pose Adapter to enable pose transfer with given sequences. Extensive experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.
中文:HumanDiT是一种基于姿态引导扩散Transformer的框架,通过支持多种分辨率和采用前缀潜在参考策略,有效解决了长视频序列中身体细节渲染和视觉一致性的问题,并在14,000小时视频数据上训练实现高质量生成。
English: HumanDiT is a pose-guided Diffusion Transformer framework that overcomes limitations in rendering detailed body parts and maintaining visual consistency in long video sequences by supporting multiple resolutions and using a prefix-latent reference strategy, trained on 14,000 hours of video data.

Authors:Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, Walter Quattrociocchi
Title: Decoding AI Judgment: How LLMs Assess News Credibility and Bias
Abstract:
Large Language Models (LLMs) are increasingly embedded in workflows that involve evaluative processes. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check (MBFC)--and against human judgments collected through a controlled experiment. To enable direct comparison, we implement a structured agentic framework in which both models and non-expert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, LLMs rely on different mechanisms: lexical associations and statistical priors replace contextual reasoning. This reliance produces systematic effects: political asymmetries, opaque justifications, and a tendency to confuse linguistic form with epistemic validity. Delegating judgment to such systems does not merely automate evaluation--it redefines it, shifting from normative reasoning to pattern-based approximation.
中文摘要:大型语言模型越来越多地应用于评估流程,但其依赖词汇联想和统计模式而非情境推理,导致系统性偏见,使评估从规范性判断转向基于模式的近似处理。
English Summary: Large Language Models (LLMs) are increasingly used in evaluative workflows, but they rely on lexical associations and statistical patterns instead of contextual reasoning, leading to systematic biases and a shift from normative judgment to pattern-based approximation.

Authors:Juraj Vladika, Stephen Meisenbacher, Florian Matthes
Title: Lexical Substitution is not Synonym Substitution: On the Importance of Producing Contextually Relevant Word Substitutes
Abstract:
Lexical Substitution is the task of replacing a single word in a sentence with a similar one. This should ideally be one that is not necessarily only synonymous, but also fits well into the surrounding context of the target word, while preserving the sentence's grammatical structure. Recent advances in Lexical Substitution have leveraged the masked token prediction task of Pre-trained Language Models to generate replacements for a given word in a sentence. With this technique, we introduce ConCat, a simple augmented approach which utilizes the original sentence to bolster contextual information sent to the model. Compared to existing approaches, it proves to be very effective in guiding the model to make contextually relevant predictions for the target word. Our study includes a quantitative evaluation, measured via sentence similarity and task performance. In addition, we conduct a qualitative human analysis to validate that users prefer the substitutions proposed by our method, as opposed to previous methods. Finally, we test our approach on the prevailing benchmark for Lexical Substitution, CoInCo, revealing potential pitfalls of the benchmark. These insights serve as the foundation for a critical discussion on the way in which Lexical Substitution is evaluated.
中文: 本研究提出ConCat方法,通过增强预训练语言模型对原句上下文信息的利用,在词汇替换任务中生成更符合语境的替代词,在定量评估和人工偏好方面均优于现有方法,并对当前评估基准提出了批判性反思。
English: The study introduces ConCat, an enhanced lexical substitution method that leverages pre-trained language models and original sentence context to generate more contextually appropriate word replacements, outperforming existing approaches in both quantitative evaluations and human preference while also critiquing current evaluation benchmarks.

Authors:Lei Zhao, Linfeng Feng, Dongxu Ge, Rujin Chen, Fangqiu Yi, Chi Zhang, Xiao-Lei Zhang, Xuelong Li
Title: UniForm: A Unified Multi-Task Diffusion Transformer for Audio-Video Generation
Abstract:
With the rise of diffusion models, audio-video generation has been revolutionized. However, most existing methods rely on separate modules for each modality, with limited exploration of unified generative architectures. In addition, many are confined to a single task and small-scale datasets. To overcome these limitations, we introduce UniForm, a unified multi-task diffusion transformer that generates both audio and visual modalities in a shared latent space. By using a unified denoising network, UniForm captures the inherent correlations between sound and vision. Additionally, we propose task-specific noise schemes and task tokens, enabling the model to support multiple tasks with a single set of parameters, including video-to-audio, audio-to-video and text-to-audio-video generation. Furthermore, by leveraging large language models and a large-scale text-audio-video combined dataset, UniForm achieves greater generative diversity than prior approaches. Experiments show that UniForm achieves performance close to the state-of-the-art single-task models across three generation tasks, with generated content that is not only highly aligned with real-world data distributions but also enables more diverse and fine-grained generation.
UniForm is a unified multi-task diffusion transformer that generates audio and visual content in a shared latent space, achieving state-of-the-art performance across multiple generation tasks while enabling more diverse and fine-grained outputs.
English Summary:

Authors:Xuan Li, Chang Yu, Wenxin Du, Ying Jiang, Tianyi Xie, Yunuo Chen, Yin Yang, Chenfanfu Jiang
Title: Dress-1-to-3: Single Image to Simulation-Ready 3D Outfit with Diffusion Prior and Differentiable Physics
Abstract:
Recent advances in large models have significantly advanced image-to-3D reconstruction. However, the generated models are often fused into a single piece, limiting their applicability in downstream tasks. This paper focuses on 3D garment generation, a key area for applications like virtual try-on with dynamic garment animations, which require garments to be separable and simulation-ready. We introduce Dress-1-to-3, a novel pipeline that reconstructs physics-plausible, simulation-ready separated garments with sewing patterns and humans from an in-the-wild image. Starting with the image, our approach combines a pre-trained image-to-sewing pattern generation model for creating coarse sewing patterns with a pre-trained multi-view diffusion model to produce multi-view images. The sewing pattern is further refined using a differentiable garment simulator based on the generated multi-view images. Versatile experiments demonstrate that our optimization approach substantially enhances the geometric alignment of the reconstructed 3D garments and humans with the input image. Furthermore, by integrating a texture generation module and a human motion generation module, we produce customized physics-plausible and realistic dynamic garment demonstrations. Project page: https://dress-1-to-3.github.io/
中文: 本文提出Dress-1-to-3创新流程,通过单张图像重建可分离、具备缝纫图案且适用于仿真的3D服装,显著提升了几何对齐效果,并结合纹理与运动生成模块实现了逼真的动态服装演示。
English: This paper introduces Dress-1-to-3, a novel pipeline that reconstructs separable, simulation-ready 3D garments with sewing patterns from a single image, significantly improving geometric alignment and enabling realistic dynamic demonstrations through integrated texture and motion generation.

Authors:Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen
Title: COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
Abstract:
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.
Chinese: 本文介绍了COCONut-PanCap数据集,它通过提供与全景分割掩码关联的细粒度区域级标注,显著提升了全景分割和基于图像的描述任务性能,为视觉语言模型的理解和生成能力设立了新基准。
English: This paper presents the COCONut-PanCap dataset, which enhances panoptic segmentation and grounded image captioning by providing fine-grained, region-level captions tied to panoptic masks, thereby improving vision-language model performance in both understanding and generation tasks.

Authors:Yuhao Qing, Guichao Zhu, Fanxin Li, Lintian Lei, Zekai Sun, Xiuxian Guan, Shixiong Zhao, Xusheng Chen, Dong Huang, Sen Wang, Heming Cui
Title: Hecate: Unlocking Efficient Sparse Model Training via Fully Sharded Sparse Data Parallelism
Abstract:
Mixture-of-Experts (MoE) has emerged as a promising sparse paradigm for scaling up pre-trained models (PTMs) with remarkable cost-effectiveness. However, the dynamic nature of MoE leads to rapid fluctuations and imbalances in expert loads during training, resulting in significant straggler effects that hinder training performance when using expert parallelism (EP). Existing MoE training systems attempt to mitigate these effects through expert rearrangement strategies, but they face challenges in terms of memory efficiency and timeliness of rearrangement. This paper proposes Fully Sharded Sparse Data Parallelism (FSSDP), an innovative approach that tackles the parallelization of MoE layers and potential straggler effects caused by imbalanced expert loads from a new perspective. FSSDP fully shards the parameters and optimizer states of MoE layers across devices and sparsely materializes MoE parameters from scratch in each iteration with two sparse collectives SparseAllGather and SparseReduceScatter. We build Hecate, a high-performance MoE training system that incorporates FSSDP to fully unlock its potential. Hecate introduces heterogeneous sharding, sparse materialization, and re-materialization techniques to construct flexible and efficient expert placements with low memory and communication overhead. Our evaluation reveals that Hecate achieves up to 3.54x speedup compared over state-of-the-art MoE training systems and consistently demonstrates improvements across model architectures and hardware environments.
中文: 本文提出全分片稀疏数据并行方法,通过跨设备分片专家混合层参数并采用稀疏集合操作来解决训练中的负载不均衡问题,其实现的Hecate系统相比现有方法最高可获得3.54倍加速效果。
English: This paper introduces Fully Sharded Sparse Data Parallelism (FSSDP), a novel approach that addresses the straggler effects in Mixture-of-Experts training by sharding parameters across devices and employing sparse collectives, implemented in the Hecate system which achieves significant speed improvements.

Authors:Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, Gerhard Neumann
Title: DIME:Diffusion-Based Maximum Entropy Reinforcement Learning
Abstract:
Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges-primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). \emph{DIME} leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.
中文: 本文提出DIME方法,通过推导最大熵目标的下界并设计可证明收敛的策略迭代方案,解决了扩散策略中边缘熵计算难题,在高维控制任务中显著优于现有扩散方法,同时降低计算复杂度。
English: The paper introduces DIME, a diffusion-based maximum entropy reinforcement learning method that overcomes the intractability of computing marginal entropy in diffusion policies by deriving a lower bound on the maximum entropy objective and proposing a provably convergent policy iteration scheme, achieving superior performance on high-dimensional control tasks while reducing computational complexity.

Authors:Javier Rando, Jie Zhang, Nicholas Carlini, Florian Tramèr
Title: Adversarial ML Problems Are Getting Harder to Solve and to Evaluate
Abstract:
In the past decade, considerable research effort has been devoted to securing machine learning (ML) models that operate in adversarial settings. Yet, progress has been slow even for simple "toy" problems (e.g., robustness to small adversarial perturbations) and is often hindered by non-rigorous evaluations. Today, adversarial ML research has shifted towards studying larger, general-purpose language models. In this position paper, we argue that the situation is now even worse: in the era of LLMs, the field of adversarial ML studies problems that are (1) less clearly defined, (2) harder to solve, and (3) even more challenging to evaluate. As a result, we caution that yet another decade of work on adversarial ML may fail to produce meaningful progress.
中文: 过去十年,对抗性机器学习研究进展缓慢且评估不严谨,随着转向大型语言模型,该领域面临的问题更模糊、评估更困难,可能再十年也难以取得实质性进展。
English: Over the past decade, adversarial machine learning research has struggled with slow progress and non-rigorous evaluations, and with the shift to large language models, the field now faces even more ill-defined problems and evaluation challenges, risking another decade of minimal meaningful advancement.

Authors:Haibo Tong, Zhaoyang Wang, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, Mingyu Ding, Rafael Rafailov, Chelsea Finn, Huaxiu Yao
Title: MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation
Abstract:
Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and bias. Addressing these limitations, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments. Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively. Additionally, introducing MJ-VIDEO for preference tuning in video generation enhances the alignment performance. All our code, data, and models are available at https://aiming-lab.github.io/MJ-VIDEO.github.io/.
中文: 研究者提出了MJ-BENCH-VIDEO视频偏好基准和基于专家混合的MJ-VIDEO奖励模型,显著提升了视频偏好评估与生成对齐的性能。
English: The authors introduce MJ-BENCH-VIDEO, a comprehensive video preference benchmark, and MJ-VIDEO, a Mixture-of-Experts-based reward model that significantly improves video preference assessment and generation alignment.

Authors:Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li
Title: LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient
Abstract:
The rapid advancement of large language models (LLMs) has led to a surge in both model supply and application demands. To facilitate effective matching between them, reliable, generic and efficient benchmark generators are widely needed. However, human annotators are constrained by inefficiency, and current LLM benchmark generators not only lack generalizability but also struggle with limited reliability, as they lack a comprehensive evaluation framework for validation and optimization. To fill this gap, we first propose an automated and unbiased evaluation framework, structured around four dimensions and ten criteria. Under this framework, we carefully analyze the advantages and weaknesses of directly prompting LLMs as generic benchmark generators. To enhance the reliability, we introduce a series of methods to address the identified weaknesses and integrate them as BenchMaker. Experiments across multiple LLMs and tasks confirm that BenchMaker achieves superior or comparable performance to human-annotated benchmarks on all metrics, highlighting its generalizability and reliability. More importantly, it delivers highly consistent evaluation results across 12 LLMs (0.967 Pearson correlation against MMLU-Pro), while taking only $0.005 and 0.38 minutes per sample.
Chinese: 本研究提出BenchMaker自动化基准生成器,通过采用全面的评估框架克服了人工标注和现有基于大语言模型的生成器的局限性,在多项任务和模型中展现出卓越的效率、可靠性和泛化能力。
English: This study introduces BenchMaker, an automated benchmark generator that overcomes the limitations of human annotation and existing LLM-based generators by employing a comprehensive evaluation framework, demonstrating superior efficiency, reliability, and generalizability across multiple tasks and models.

Authors:Maike Scherer, Lukas Brand, Louis Wolf, Teena tom Dieck, Maximilian Schäfer, Sebastian Lotter, Andreas Burkovski, Heinrich Sticht, Robert Schober, Kathrin Castiglione
Title: Closed-Loop Long-Term Experimental Molecular Communication System
Abstract:
We present a fluid-based experimental molecular communication (MC) testbed which uses media modulation. Motivated by the natural human cardiovascular system, the testbed operates in a closed-loop tube system. The proposed system is designed to be biocompatible, resource-efficient, and controllable from outside the tube. As signaling molecule, the testbed employs the green fluorescent protein variant "Dreiklang" (GFPD). GFPDs can be reversibly switched via light of different wavelengths between a bright fluorescent state and a less fluorescent state. GFPDs in solution are filled into the testbed prior to the start of information transmission and remain there for an entire experiment. For information transmission, an optical transmitter (TX) and an optical eraser (EX), which are located outside the tube, are used to write and erase the information encoded in the state of the GFPDs, respectively. At the receiver (RX), the state of the GFPDs is read out by fluorescence detection. In our testbed, due to the closed-loop setup, we observe new forms of inter-symbol interferences (ISI), which do not occur in short experiments and open-loop systems. For the testbed, we developed a communication scheme, which includes blind transmission start detection, symbol-by-symbol synchronization, and adaptive threshold detection. We comprehensively analyze our MC experiments using different performance metrics. Moreover, we experimentally demonstrate the error-free transmission of 5370 bit at a data rate of 36 $\textrm{bit}\, \textrm{min}^{\boldsymbol{-1}}$ using 8-ary modulation and the error-free binary transmission of around 90000 bit at a data rate of 12 $\textrm{bit}\, \textrm{min}^{\boldsymbol{-1}}$. For the latter experiment, data was transmitted for a period of 125 hours. All signals recorded and parts of the evaluation code are publicly available on Zenodo and Github, respectively.
中文: 本研究提出了一种基于流体的生物兼容分子通信测试平台,利用光控荧光蛋白进行信息编码,在闭环系统中实现了长时间无差错数据传输,并解决了新型符号间干扰问题。
English: This study introduces a biocompatible, fluid-based molecular communication testbed using light-controlled fluorescent proteins for information encoding, achieving error-free data transmission over extended periods with novel interference management in a closed-loop system.

Authors:Xuyin Qi, Zeyu Zhang, Huazhan Zheng, Mingxi Chen, Numan Kutaiba, Ruth Lim, Cherie Chiang, Zi En Tham, Xuan Ren, Wenxin Zhang, Lei Zhang, Hao Zhang, Wenbing Lv, Guangzhen Yao, Renda Han, Kangsheng Wang, Mingyuan Li, Hongtao Mao, Yu Li, Zhibin Liao, Yang Zhao, Minh-Son To
Title: MedConv: Convolutions Beat Transformers on Long-Tailed Bone Density Prediction
Abstract:
Bone density prediction via CT scans to estimate T-scores is crucial, providing a more precise assessment of bone health compared to traditional methods like X-ray bone density tests, which lack spatial resolution and the ability to detect localized changes. However, CT-based prediction faces two major challenges: the high computational complexity of transformer-based architectures, which limits their deployment in portable and clinical settings, and the imbalanced, long-tailed distribution of real-world hospital data that skews predictions. To address these issues, we introduce MedConv, a convolutional model for bone density prediction that outperforms transformer models with lower computational demands. We also adapt Bal-CE loss and post-hoc logit adjustment to improve class balance. Extensive experiments on our AustinSpine dataset shows that our approach achieves up to 21% improvement in accuracy and 20% in ROC AUC over previous state-of-the-art methods.
中文:MedConv卷积模型通过解决计算复杂性和数据不平衡问题,在骨密度预测中显著提升了准确率和ROC AUC指标,优于现有最先进方法。
English: MedConv, a convolutional model for bone density prediction, overcomes the computational limitations of transformers and data imbalance issues, achieving significant improvements in accuracy and ROC AUC on the AustinSpine dataset.

Authors:Jocelyn Shen, Audrey Lee, Sharifa Alghowinem, River Adkins, Cynthia Breazeal, Hae Won Park
Title: Social Robots as Social Proxies for Fostering Connection and Empathy Towards Humanity
Abstract:
Despite living in an increasingly connected world, social isolation is a prevalent issue today. While social robots have been explored as tools to enhance social connection through companionship, their potential as asynchronous social platforms for fostering connection towards humanity has received less attention. In this work, we introduce the design of a social support companion that facilitates the exchange of emotionally relevant stories and scaffolds reflection to enhance feelings of connection via five design dimensions. We investigate how social robots can serve as "social proxies" facilitating human stories, passing stories from other human narrators to the user. To this end, we conduct a real-world deployment of 40 robot stations in users' homes over the course of two weeks. Through thematic analysis of user interviews, we find that social proxy robots can foster connection towards other people's experiences via mechanisms such as identifying connections across stories or offering diverse perspectives. We present design guidelines from our study insights on the use of social robot systems that serve as social platforms to enhance human empathy and connection.
Chinese: 本研究探讨了社交机器人作为异步平台分享情感故事的作用,发现通过充当促进故事交流和反思的社交代理,它们能够增强人类的同理心和连接感。
English: This study explores social robots as asynchronous platforms for sharing emotional stories, finding they can enhance human empathy and connection by serving as social proxies that facilitate story exchange and reflection.

Authors:Nesryne Mejri, Enjie Ghorbel, Anis Kacem, Pavel Chernakov, Niki Foteinopoulou, Djamila Aouada
Title: When Unsupervised Domain Adaptation meets One-class Anomaly Detection: Addressing the Two-fold Unsupervised Curse by Leveraging Anomaly Scarcity
Abstract:
This paper introduces the first fully unsupervised domain adaptation (UDA) framework for unsupervised anomaly detection (UAD). The performance of UAD techniques degrades significantly in the presence of a domain shift, difficult to avoid in a real-world setting. While UDA has contributed to solving this issue in binary and multi-class classification, such a strategy is ill-posed in UAD. This might be explained by the unsupervised nature of the two tasks, namely, domain adaptation and anomaly detection. Herein, we first formulate this problem that we call the two-fold unsupervised curse. Then, we propose a pioneering solution to this curse, considered intractable so far, by assuming that anomalies are rare. Specifically, we leverage clustering techniques to identify a dominant cluster in the target feature space. Posed as the normal cluster, the latter is aligned with the source normal features. Concretely, given a one-class source set and an unlabeled target set composed mostly of normal data and some anomalies, we fit the source features within a hypersphere while jointly aligning them with the features of the dominant cluster from the target set. The paper provides extensive experiments and analysis on common adaptation benchmarks for anomaly detection, demonstrating the relevance of both the newly introduced paradigm and the proposed approach. The code will be made publicly available.
中文: 本文提出了首个完全无监督的异常检测领域自适应框架,通过假设异常稀少并利用聚类技术识别目标域中的主导簇,将其与源域正常特征对齐,从而解决领域偏移导致的性能下降问题。
English: This paper presents the first fully unsupervised domain adaptation framework for anomaly detection, addressing performance degradation from domain shifts by aligning source normal features with the target's dominant cluster under the assumption that anomalies are rare.

Authors:David Isele, Alexandre Miranda Anon, Faizan M. Tariq, Goro Yeh, Avinash Singh, Sangjae Bae
Title: Delayed-Decision Motion Planning in the Presence of Multiple Predictions
Abstract:
Reliable automated driving technology is challenged by various sources of uncertainties, in particular, behavioral uncertainties of traffic agents. It is common for traffic agents to have intentions that are unknown to others, leaving an automated driving car to reason over multiple possible behaviors. This paper formalizes a behavior planning scheme in the presence of multiple possible futures with corresponding probabilities. We present a maximum entropy formulation and show how, under certain assumptions, this allows delayed decision-making to improve safety. The general formulation is then turned into a model predictive control formulation, which is solved as a quadratic program or a set of quadratic programs. We discuss implementation details for improving computation and verify operation in simulation and on a mobile robot.
中文: 本文提出了一种针对自动驾驶车辆的行为规划方案,通过采用最大熵方法和模型预测控制来处理交通参与者意图的不确定性,从而通过延迟决策来提高安全性。
English: This paper introduces a behavior planning scheme for automated vehicles that addresses uncertainties in traffic agent intentions by employing a maximum entropy approach and model predictive control to enhance safety through delayed decision-making.

Authors:Yujie Feng, Liming Zhan, Zexin Lu, Yongxin Xu, Xu Chu, Yasha Wang, Jiannong Cao, Philip S. Yu, Xiao-Ming Wu
Title: GeoEdit: Geometric Knowledge Editing for Large Language Models
Abstract:
Regular updates are essential for maintaining up-to-date knowledge in large language models (LLMs). Consequently, various model editing methods have been developed to update specific knowledge within LLMs. However, training-based approaches often struggle to effectively incorporate new knowledge while preserving unrelated general knowledge. To address this challenge, we propose a novel framework called Geometric Knowledge Editing (GeoEdit). GeoEdit utilizes the geometric relationships of parameter updates from fine-tuning to differentiate between neurons associated with new knowledge updates and those related to general knowledge perturbations. By employing a direction-aware knowledge identification method, we avoid updating neurons with directions approximately orthogonal to existing knowledge, thus preserving the model's generalization ability. For the remaining neurons, we integrate both old and new knowledge for aligned directions and apply a "forget-then-learn" editing strategy for opposite directions. Additionally, we introduce an importance-guided task vector fusion technique that filters out redundant information and provides adaptive neuron-level weighting, further enhancing model editing performance. Extensive experiments on two publicly available datasets demonstrate the superiority of GeoEdit over existing state-of-the-art methods.
中文摘要:提出的几何知识编辑(GeoEdit)框架通过基于几何关系识别和选择性编辑神经元,有效更新大型语言模型中的特定知识,在保留通用知识的同时,通过自适应融合技术提升模型性能。
English Summary: The proposed Geometric Knowledge Editing (GeoEdit) framework effectively updates specific knowledge in large language models by identifying and selectively editing neurons based on geometric relationships, preserving general knowledge while enhancing performance through adaptive fusion techniques.

Authors:Ashley Lewis, Michael White, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang
Title: Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents
Abstract:
The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination (generating false information) and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models' outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized "I don't know" responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.
中文: 研究表明,利用大语言模型生成的合成数据和开源模型的自训练能有效减少幻觉并构建经济高效的问答系统,同时增强对无法回答问题的鲁棒性。
English: This study demonstrates that using synthetic data from LLMs and self-training with open-source models can effectively reduce hallucination and build cost-efficient QA systems, while also improving robustness to unanswerable questions.

Authors:Zhe Wang, Shaocong Xu, Xucai Zhuang, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, Ya-Qin Zhang
Title: CoopDETR: A Unified Cooperative Perception Framework for 3D Detection via Object Query
Abstract:
Cooperative perception enhances the individual perception capabilities of autonomous vehicles (AVs) by providing a comprehensive view of the environment. However, balancing perception performance and transmission costs remains a significant challenge. Current approaches that transmit region-level features across agents are limited in interpretability and demand substantial bandwidth, making them unsuitable for practical applications. In this work, we propose CoopDETR, a novel cooperative perception framework that introduces object-level feature cooperation via object query. Our framework consists of two key modules: single-agent query generation, which efficiently encodes raw sensor data into object queries, reducing transmission cost while preserving essential information for detection; and cross-agent query fusion, which includes Spatial Query Matching (SQM) and Object Query Aggregation (OQA) to enable effective interaction between queries. Our experiments on the OPV2V and V2XSet datasets demonstrate that CoopDETR achieves state-of-the-art performance and significantly reduces transmission costs to 1/782 of previous methods.
中文:CoopDETR提出了一种基于对象查询的物体级协同感知框架,在将传输成本降至先前方法1/782的同时,显著提升了检测性能。
English: CoopDETR introduces an object-level cooperative perception framework using object queries to enhance detection performance while drastically cutting transmission costs to 1/782 of prior methods.

Authors:Orion Weller, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, Benjamin Van Durme
Title: Rank1: Test-Time Compute for Reranking in Information Retrieval
Abstract:
We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o1, Deepseek's R1, etc.) for distillation in order to rapidly improve the performance of a smaller model. We gather and open-source a dataset of more than 600,000 examples of R1 reasoning traces from queries and passages in MS MARCO. Models trained on this dataset show: (1) state-of-the-art performance on advanced reasoning and instruction following datasets; (2) work remarkably well out of distribution due to the ability to respond to user-input prompts; and (3) have explainable reasoning chains that can be given to users or RAG-based systems. Further, we demonstrate that quantized versions of these models retain strong performance while using less compute/memory. Overall, Rank1 shows that test-time compute allows for a fundamentally new type of explainable and performant reranker model for search.
中文摘要:Rank1是首个利用测试时计算的重排序模型,通过推理语言模型进行蒸馏以提升小模型性能,在先进推理任务中表现卓越,具有可解释的推理链和高效的量化版本。
English Summary: Rank1 is the first reranking model leveraging test-time compute, using reasoning language models for distillation to enhance smaller models' performance, achieving state-of-the-art results with explainable reasoning and efficient quantized versions.

Authors:Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A. Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, Ying Shen, Barry Menglong Yao, Zhiyang Xu, Qin Liu, Yuxiang Zhang, Yan Sun, Shilong Liu, Li Shen, Hongxuan Li, Soheil Feizi, Lifu Huang
Title: A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models
Abstract:
The rise of foundation models has transformed machine learning research, prompting efforts to uncover their inner workings and develop more efficient and reliable applications for better control. While significant progress has been made in interpreting Large Language Models (LLMs), multimodal foundation models (MMFMs) - such as contrastive vision-language models, generative vision-language models, and text-to-image models - pose unique interpretability challenges beyond unimodal frameworks. Despite initial studies, a substantial gap remains between the interpretability of LLMs and MMFMs. This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems. By systematically reviewing current MMFM analysis techniques, we propose a structured taxonomy of interpretability methods, compare insights across unimodal and multimodal architectures, and highlight critical research gaps.
中文: 本综述探讨了如何将大型语言模型的解释方法应用于多模态基础模型,同时分析单模态与跨模态系统的根本差异,提出了结构化分类法并指出了关键研究空白。
English: This survey examines how interpretability methods from large language models can be adapted for multimodal foundation models while analyzing fundamental differences between unimodal and crossmodal systems, proposing a structured taxonomy and identifying key research gaps.

Authors:Yujie Feng, Xujia Wang, Zexin Lu, Shenghong Fu, Guangyuan Shi, Yongxin Xu, Yasha Wang, Philip S. Yu, Xu Chu, Xiao-Ming Wu
Title: Recurrent Knowledge Identification and Fusion for Language Model Continual Learning
Abstract:
Continual learning (CL) is crucial for deploying large language models (LLMs) in dynamic real-world environments without costly retraining. While recent model ensemble and model merging methods guided by parameter importance have gained popularity, they often struggle to balance knowledge transfer and forgetting, mainly due to the reliance on static importance estimates during sequential training. In this paper, we present Recurrent-KIF, a novel CL framework for Recurrent Knowledge Identification and Fusion, which enables dynamic estimation of parameter importance distributions to enhance knowledge transfer. Inspired by human continual learning, Recurrent-KIF employs an inner loop that rapidly adapts to new tasks while identifying important parameters, coupled with an outer loop that globally manages the fusion of new and historical knowledge through redundant knowledge pruning and key knowledge merging. These inner-outer loops iteratively perform multiple rounds of fusion, allowing Recurrent-KIF to leverage intermediate training information and adaptively adjust fusion strategies based on evolving importance distributions. Extensive experiments on two CL benchmarks with various model sizes (from 770M to 13B) demonstrate that Recurrent-KIF effectively mitigates catastrophic forgetting and enhances knowledge transfer.
中文:Recurrent-KIF提出了一种新颖的持续学习框架,通过内外循环交互动态评估参数重要性,在不同规模模型中有效平衡知识迁移并显著缓解灾难性遗忘问题。
English: Recurrent-KIF introduces a novel continual learning framework that dynamically estimates parameter importance through inner-outer loop interactions, effectively balancing knowledge transfer and mitigating catastrophic forgetting across various model sizes.

Authors:Bálint Tóth, Dominik Senti, Thorir Mar Ingolfsson, Jeffrey Zweidler, Alexandre Elsig, Luca Benini, Yawei Li
Title: Finetuning and Quantization of EEG-Based Foundational BioSignal Models on ECG and PPG Data for Blood Pressure Estimation
Abstract:
Blood pressure (BP) is a key indicator of cardiovascular health. As hypertension remains a global cause of morbidity and mortality, accurate, continuous, and non-invasive BP monitoring is therefore of paramount importance. Photoplethysmography (PPG) and electrocardiography (ECG) can potentially enable continuous BP monitoring, yet training accurate and robust machine learning (ML) models remains challenging due to variability in data quality and patient-specific factors. Recently, multiple research groups explored Electroencephalographic (EEG)--based foundation models and demonstrated their exceptional ability to learn rich temporal resolution. Considering the morphological similarities between different biosignals, the question arises of whether a model pre-trained on one modality can effectively be exploited to improve the accuracy of a different signal type. In this work, we take an initial step towards generalized biosignal foundation models by investigating whether model representations learned from abundant EEG data can effectively be transferred to ECG/PPG data solely with fine-tuning, without the need for large-scale additional pre-training, for the BP estimation task. Evaluations on the MIMIC-III and VitalDB datasets demonstrate that our approach achieves near state-of-the-art accuracy for diastolic BP (mean absolute error of 1.57 mmHg) and surpasses by 1.5x the accuracy of prior works for systolic BP (mean absolute error 2.72 mmHg). Additionally, we perform dynamic INT8 quantization, reducing the smallest model size by over 3.5x (from 13.73 MB down to 3.83 MB) while preserving performance, thereby enabling unobtrusive, real-time BP monitoring on resource-constrained wearable devices.
中文: 本研究探索将预训练的脑电基础模型迁移至心电/光电容积信号进行血压估计,通过模型量化在保持精度的同时实现可穿戴设备上的实时监测,达到了接近最优的准确性。
English: This study explores transferring pre-trained EEG foundation models to ECG/PPG signals for blood pressure estimation, achieving near state-of-the-art accuracy with quantized models that enable real-time monitoring on wearable devices.

Authors:Miaomiao Cai, Guanjie Wang, Wei Li, Zhijun Tu, Hanting Chen, Shaohui Lin, Jie Hu
Title: Autoregressive Image Generation with Vision Full-view Prompt
Abstract:
In autoregressive (AR) image generation, models based on the 'next-token prediction' paradigm of LLMs have shown comparable performance to diffusion models by reducing inductive biases. However, directly applying LLMs to complex image generation can struggle with reconstructing the image's structure and details, impacting the generation's accuracy and stability. Additionally, the 'next-token prediction' paradigm in the AR model does not align with the contextual scanning and logical reasoning processes involved in human visual perception, limiting effective image generation. Prompt engineering, as a key technique for guiding LLMs, leverages specifically designed prompts to improve model performance on complex natural language processing (NLP) tasks, enhancing accuracy and stability of generation while maintaining contextual coherence and logical consistency, similar to human reasoning. Inspired by prompt engineering from the field of NLP, we propose Vision Full-view prompt (VF prompt) to enhance autoregressive image generation. Specifically, we design specialized image-related VF prompts for AR image generation to simulate the process of human image creation. This enhances contextual logic ability by allowing the model to first perceive overall distribution information before generating the image, and improve generation stability by increasing the inference steps. Compared to the AR method without VF prompts, our method shows outstanding performance and achieves an approximate improvement of 20%.
中文摘要:所提出的全景视觉提示通过模拟人类视觉感知过程来增强自回归图像生成,提升了上下文逻辑能力和生成稳定性,相比无提示方法性能提高约20%。
English Summary: The proposed Vision Full-view prompt enhances autoregressive image generation by simulating human visual perception, improving contextual logic and stability with a 20% performance gain over standard methods.

Authors:Haoran Li, Zicheng Zhang, Wang Luo, Congying Han, Jiayu Lv, Tiande Guo, Yudong Hu
Title: Towards Optimal Adversarial Robust Reinforcement Learning with Infinity Measurement Error
Abstract:
Ensuring the robustness of deep reinforcement learning (DRL) agents against adversarial attacks is critical for their trustworthy deployment. Recent research highlights the challenges of achieving state-adversarial robustness and suggests that an optimal robust policy (ORP) does not always exist, complicating the enforcement of strict robustness constraints. In this paper, we further explore the concept of ORP. We first introduce the Intrinsic State-adversarial Markov Decision Process (ISA-MDP), a novel formulation where adversaries cannot fundamentally alter the intrinsic nature of state observations. ISA-MDP, supported by empirical and theoretical evidence, universally characterizes decision-making under state-adversarial paradigms. We rigorously prove that within ISA-MDP, a deterministic and stationary ORP exists, aligning with the Bellman optimal policy. Our findings theoretically reveal that improving DRL robustness does not necessarily compromise performance in natural environments. Furthermore, we demonstrate the necessity of infinity measurement error (IME) in both $Q$-function and probability spaces to achieve ORP, unveiling vulnerabilities of previous DRL algorithms that rely on $1$-measurement errors. Motivated by these insights, we develop the Consistent Adversarial Robust Reinforcement Learning (CAR-RL) framework, which optimizes surrogates of IME. We apply CAR-RL to both value-based and policy-based DRL algorithms, achieving superior performance and validating our theoretical analysis.
中文摘要:本文提出内在状态对抗马尔可夫决策过程(ISA-MDP)框架,证明其能实现确定性最优鲁棒策略且不牺牲性能,并开发了通过优化无限测量误差来超越现有方法的CAR-RL算法。
English Summary: The paper introduces the Intrinsic State-adversarial MDP (ISA-MDP) framework, proving it enables deterministic optimal robust policies without performance trade-offs, and develops the CAR-RL method that outperforms prior approaches by optimizing infinite measurement errors.

Authors:Yuying Tang, Haotian Li, Minghe Lan, Xiaojuan Ma, Huamin Qu
Title: Understanding Screenwriters' Practices, Attitudes, and Future Expectations in Human-AI Co-Creation
Abstract:
With the rise of AI technologies and their growing influence in the screenwriting field, understanding the opportunities and concerns related to AI's role in screenwriting is essential for enhancing human-AI co-creation. Through semi-structured interviews with 23 screenwriters, we explored their creative practices, attitudes, and expectations in collaborating with AI for screenwriting. Based on participants' responses, we identified the key stages in which they commonly integrated AI, including story structure & plot development, screenplay text, goal & idea generation, and dialogue. Then, we examined how different attitudes toward AI integration influence screenwriters' practices across various workflow stages and their broader impact on the industry. Additionally, we categorized their expected assistance using four distinct roles of AI: actor, audience, expert, and executor. Our findings provide insights into AI's impact on screenwriting practices and offer suggestions on how AI can benefit the future of screenwriting.
中文: 本研究通过采访23位编剧,探讨了AI在剧本创作中的整合应用,识别出关键合作阶段及AI作为演员、观众、专家和执行者四种角色,以促进人机协同创作的发展。
English: This study investigates AI's integration in screenwriting through interviews with 23 professionals, identifying key collaboration stages and AI's roles as actor, audience, expert, and executor to enhance human-AI co-creation.

Authors:Yanna Lin, Leni Yang, Haotian Li, Huamin Qu, Dominik Moritz
Title: InterLink: Linking Text with Code and Output in Computational Notebooks
Abstract:
Computational notebooks, widely used for ad-hoc analysis and often shared with others, can be difficult to understand because the standard linear layout is not optimized for reading. In particular, related text, code, and outputs may be spread across the UI making it difficult to draw connections. In response, we introduce InterLink, a plugin designed to present the relationships between text, code, and outputs, thereby making notebooks easier to understand. In a formative study, we identify pain points and derive design requirements for identifying and navigating relationships among various pieces of information within notebooks. Based on these requirements, InterLink features a new layout that separates text from code and outputs into two columns. It uses visual links to signal relationships between text and associated code and outputs and offers interactions for navigating related pieces of information. In a user study with 12 participants, those using InterLink were 13.6% more accurate at finding and integrating information from complex analyses in computational notebooks. These results show the potential of notebook layouts that make them easier to understand.
Chinese: InterLink插件通过双栏布局和视觉链接将相关文本、代码和输出关联起来,解决了计算笔记本难以理解的问题,使用户在查找和整合信息时的准确率提高了13.6%。
English: The InterLink plugin addresses the difficulty in understanding computational notebooks by introducing a two-column layout with visual links to connect related text, code, and outputs, which improved users' accuracy in finding and integrating information by 13.6%.

Authors:Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, Ninghao Liu
Title: Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders
Abstract:
Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture the semantic meaning behind these features. We further propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations. Empirical results show that, compared to baselines, our method provides more discourse-level explanations and effectively steers LLM behaviors to defend against jailbreak attacks. These findings highlight the value of explanations for steering LLM behaviors in downstream applications. We will release our code and data once accepted.
中文: 本研究针对大型语言模型中稀疏自编码器特征解释的频率偏差问题,提出了基于固定词汇和互信息目标的方法,能提供更具语义的解释并有效引导模型防御越狱攻击。
English: This study addresses the frequency bias in existing methods for interpreting sparse autoencoder features in large language models by introducing a fixed-vocabulary approach and mutual information-based objective, which yield more semantic explanations and enhance model steering against jailbreak attacks.

Authors:Tianyi Ma, Yiyue Qian, Shinan Zhang, Chuxu Zhang, Yanfang Ye
Title: Adaptive Expansion for Hypergraph Learning
Abstract:
Hypergraph, with its powerful ability to capture higher-order relationships, has gained significant attention recently. Consequently, many hypergraph representation learning methods have emerged to model the complex relationships among hypergraphs. In general, these methods leverage classic expansion methods to convert hypergraphs into weighted or bipartite graphs, and further employ message passing mechanisms to model the complex structures within hypergraphs. However, classical expansion methods are designed in straightforward manners with fixed edge weights, resulting in information loss or redundancy. In light of this, we design a novel clique expansion-based Adaptive Expansion method called AdE to adaptively expand hypergraphs into weighted graphs that preserve the higher-order structure information. Specifically, we introduce a novel Global Simulation Network to select two representative nodes for adaptively symbolizing each hyperedge and connect the rest of the nodes within the same hyperedge to the corresponding selected nodes. Afterward, we design a distance-aware kernel function, dynamically adjusting edge weights to ensure similar nodes within a hyperedge are connected with larger weights. Extensive theoretical justifications and empirical experiments over seven benchmark hypergraph datasets demonstrate that AdE has excellent rationality, generalization, and effectiveness compared to classic expansion models.
中文: 提出的AdE方法通过全局模拟网络和距离感知核函数自适应地将超图扩展为加权图,有效保留高阶结构信息,在理论和实证评估中均优于传统扩展模型。
English: The proposed AdE method adaptively expands hypergraphs into weighted graphs using a Global Simulation Network and a distance-aware kernel function to preserve higher-order structural information, outperforming traditional expansion models in both theoretical and empirical evaluations.

Authors:Hao Huang, Shuaihang Yuan, Yu Hao, Congcong Wen, Yi Fang
Title: A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models
Abstract:
A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior, which makes it easier to generate images and language that are more natural and realistic. Despite this, there is still a significant domain gap between the modalities of vision and language, especially when training data is scarce in few-shot settings, where only very limited data are available for training. In order to mitigate this issue, a multi-modal meta-learning framework has been proposed to bridge the gap between two frozen pretrained large vision and language models by introducing a tunable prompt connecting these two large models. For few-shot image captioning, the existing multi-model meta-learning framework utilizes a one-step prompting scheme to accumulate the visual features of input images to guide the language model, which struggles to generate accurate image descriptions with only a few training samples. Instead, we propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images. In addition, we further propose to learn different meta-parameters of the model corresponding to each CoT step in distinct subspaces to avoid interference. We evaluated our method on three commonly used image captioning datasets, i.e., MSCOCO, Flickr8k, and Flickr30k, under few-shot settings. The results of our experiments indicate that our chain-of-thought subspace meta-learning strategy is superior to the baselines in terms of performance across different datasets measured by different metrics.
中文: 本文提出了一种思维链元学习策略,通过模拟人类多步骤推理过程并在不同子空间中学习各步骤参数,有效弥合了少样本图像描述中的视觉与语言鸿沟,在多个数据集上优于现有基线方法。
English: A chain-of-thought meta-learning approach is introduced to bridge the vision-language gap in few-shot image captioning by simulating human-like multi-step reasoning and learning distinct subspaces for each step, outperforming existing methods across multiple datasets.

Authors:Pengxiang Lan, Haoyu Xu, Enneng Yang, Yuliang Liang, Guibing Guo, Jianzhe Zhao, Xingwei Wang
Title: Efficient and Effective Prompt Tuning via Prompt Decomposition and Compressed Outer Product
Abstract:
Prompt tuning (PT) offers a cost-effective alternative to fine-tuning large-scale pre-trained language models (PLMs), requiring only a few parameters in soft prompt tokens added before the input text. However, existing PT approaches face two significant issues: (i) They overlook intrinsic semantic associations between soft prompt tokens, leading to high discreteness and limited interactions, thus reducing the model's comprehension and effectiveness in complex tasks. (ii) Due to the complexity of downstream tasks, long soft prompt is necessitated to improve performance, but prompt length correlates positively with memory usage and computational costs. Achieving high efficiency and performance remains an ongoing challenge. To address these issues, we propose a novel Low-parameters prompt tuning (LAMP) method, which leverages prompt decomposition and compressed outer product. Specifically, the prompt decomposition module employs Truncated SVD to reduce training parameters and significantly lower the dimensionality of the soft prompt parameter space. It then utilizes a compressed outer product module to facilitate multiple interactions among prompt tokens, exploring their intrinsic associations to enhance knowledge representation. Finally, LAMP uses average pooling to reduce memory usage and training/inference time. Extensive experiments across six architectures and eight datasets demonstrate that LAMP outperforms state-of-the-art PT-based and LoRA-based methods in performance and efficiency.
中文: 提示调优是微调大型语言模型的轻量级替代方案,但存在提示词交互不足和长提示计算成本高的问题;LAMP方法通过分解提示和增强交互来解决这些问题,从而提升效率和性能。
English: Prompt tuning is a lightweight alternative to fine-tuning large language models but suffers from poor token interaction and high computational costs with long prompts, which the proposed LAMP method addresses by decomposing prompts and enhancing interactions to improve efficiency and performance.

Authors:Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, Guosheng Lin
Title: MagicArticulate: Make Your 3D Models Articulation-Ready
Abstract:
With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Traditional approaches rely heavily on manual annotation, which is both time-consuming and labor-intensive. Moreover, the lack of large-scale benchmarks has hindered the development of learning-based solutions. In this work, we present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets. Our key contributions are threefold. First, we introduce Articulation-XL, a large-scale benchmark containing over 33k 3D models with high-quality articulation annotations, carefully curated from Objaverse-XL. Second, we propose a novel skeleton generation method that formulates the task as a sequence modeling problem, leveraging an auto-regressive transformer to naturally handle varying numbers of bones or joints within skeletons and their inherent dependencies across different 3D models. Third, we predict skinning weights using a functional diffusion process that incorporates volumetric geodesic distance priors between vertices and joints. Extensive experiments demonstrate that MagicArticulate significantly outperforms existing methods across diverse object categories, achieving high-quality articulation that enables realistic animation. Project page: https://chaoyuesong.github.io/MagicArticulate.
中文:MagicArticulate是一个创新框架,通过构建大规模标注数据集、采用基于Transformer的自回归骨架生成方法和结合体积测地距离先验的功能扩散过程,自动将静态3D模型转换为支持逼真动画的关节化资源,在各类物体上实现卓越的动画效果。
English: MagicArticulate is an innovative framework that automatically converts static 3D models into articulation-ready assets using a large-scale benchmark, a transformer-based skeleton generation method, and functional diffusion for skinning weights, achieving superior animation quality across diverse categories.

Authors:Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan
Title: Scaling Autonomous Agents via Automatic Reward Modeling And Planning
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.
中文: 本研究提出一种框架,通过从环境反馈中自动学习奖励模型来增强大语言模型代理的决策能力,无需人工标注即可解决数据稀缺和API限制问题。
English: This study introduces a framework that automatically learns a reward model from environmental feedback to enhance LLM agents' decision-making, addressing data scarcity and API limitations without human annotations.

Authors:Renhao Pei, Yihong Liu, Peiqin Lin, François Yvon, Hinrich Schütze
Title: Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu
Abstract:
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries. Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL). However, the relative importance of each type of resource, e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear. To address this gap, this study systematically investigates how each resource and its quality affect the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an enciphered version of Manchu texts. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help. In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap a conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.
中文: 基于大语言模型的上下文机器翻译能有效利用词典和平行例句等语言资源提升低资源翻译效果,其中高质量词典和例句作用显著而语法书帮助甚微,该方法还能生成合成平行数据来增强传统机器翻译系统。
English: In-context machine translation with large language models effectively utilizes linguistic resources like dictionaries and parallel examples to improve low-resource translation, with high-quality dictionaries and examples proving most beneficial while grammars offer little help, and it can also generate synthetic parallel data to enhance conventional MT systems.

Authors:Alireza Nik, Michael A. Riegler, PÃ¥l Halvorsen
Title: Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption
Abstract:
Decoding strategies significantly influence the quality and diversity of the generated texts in large language models (LLMs), yet their impact on computational resource consumption, particularly GPU energy usage, is insufficiently studied. This paper investigates the relationship between text generation decoding methods and energy efficiency, focusing on the trade-off between generation quality and GPU energy consumption across diverse tasks and decoding configurations. By benchmarking multiple strategies across different text generation tasks, such as Translation, Code Summarization, and Math Problem Solving, we reveal how selecting appropriate decoding techniques with their tuned hyperparameters affects text quality and has measurable implications for resource utilization, emphasizing the need for balanced optimization. To the best of our knowledge, this study is among the first to explore decoding strategies in LLMs through the lens of energy consumption, offering actionable insights for designing resource-aware applications that maintain high-quality text generation.
中文: 本研究探讨了大型语言模型中不同解码策略对生成文本质量与GPU能耗的影响,强调需通过平衡优化在保证高质量输出的同时降低资源消耗。
English: This study examines how different decoding strategies in large language models affect both the quality of generated text and GPU energy consumption, highlighting the need for balanced optimization to maintain high performance while reducing resource use.

Authors:Jiazhao Liang, Hao Huang, Yu Hao, Geeta Chandra Raju Bethala, Congcong Wen, John-Ross Rizzo, Yi Fang
Title: Integrating Retrospective Framework in Multi-Robot Collaboration
Abstract:
Recent advancements in Large Language Models (LLMs) have demonstrated substantial capabilities in enhancing communication and coordination in multi-robot systems. However, existing methods often struggle to achieve efficient collaboration and decision-making in dynamic and uncertain environments, which are common in real-world multi-robot scenarios. To address these challenges, we propose a novel retrospective actor-critic framework for multi-robot collaboration. This framework integrates two key components: (1) an actor that performs real-time decision-making based on observations and task directives, and (2) a critic that retrospectively evaluates the outcomes to provide feedback for continuous refinement, such that the proposed framework can adapt effectively to dynamic conditions. Extensive experiments conducted in simulated environments validate the effectiveness of our approach, demonstrating significant improvements in task performance and adaptability. This work offers a robust solution to persistent challenges in robotic collaboration.
中文摘要:本文提出了一种用于多机器人协作的回顾性行动者-评论家框架,通过实时决策与结果评估相结合来提升动态环境中的适应性,实验验证了该方法的有效性。
English Summary: This paper introduces a retrospective actor-critic framework for multi-robot collaboration that combines real-time decision-making with outcome evaluation to enhance adaptability in dynamic environments, with experimental results confirming its effectiveness.

Authors:Yuanfei Wang, Xiaojie Zhang, Ruihai Wu, Yu Li, Yan Shen, Mingdong Wu, Zhaofeng He, Yizhou Wang, Hao Dong
Title: AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning
Abstract:
Articulated object manipulation is a critical capability for robots to perform various tasks in real-world scenarios. Composed of multiple parts connected by joints, articulated objects are endowed with diverse functional mechanisms through complex relative motions. For example, a safe consists of a door, a handle, and a lock, where the door can only be opened when the latch is unlocked. The internal structure, such as the state of a lock or joint angle constraints, cannot be directly observed from visual observation. Consequently, successful manipulation of these objects requires adaptive adjustment based on trial and error rather than a one-time visual inference. However, previous datasets and simulation environments for articulated objects have primarily focused on simple manipulation mechanisms where the complete manipulation process can be inferred from the object's appearance. To enhance the diversity and complexity of adaptive manipulation mechanisms, we build a novel articulated object manipulation environment and equip it with 9 categories of objects. Based on the environment and objects, we further propose an adaptive demonstration collection and 3D visual diffusion-based imitation learning pipeline that learns the adaptive manipulation policy. The effectiveness of our designs and proposed method is validated through both simulation and real-world experiments. Our project page is available at: https://adamanip.github.io
中文摘要:本研究构建了包含九类物体的新型关节物体操作环境,并提出基于自适应演示收集与三维视觉扩散的模仿学习方法,使机器人能够通过试错学习操作策略,并通过仿真和真实实验验证了有效性。
English Summary: This study introduces a novel articulated object manipulation environment with nine object categories and an adaptive imitation learning pipeline that enables robots to learn manipulation policies through trial and error, validated by both simulation and real-world experiments.

Authors:Yasir Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Title: Towards Data-Efficient Pretraining for Atomic Property Prediction
Abstract:
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected, task-relevant dataset can match or even surpass large-scale pretraining, while using as little as 1/24th of the computational cost. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fréchet Inception Distance, for molecular graphs which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most relevant dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently outperform those pretrained on massive, mixed datasets such as JMP, even when those larger datasets include the relevant dataset. Counterintuitively, we also find that indiscriminately adding more data can degrade model performance when the additional data poorly aligns with the task at hand. Our findings highlight that quality often outperforms quantity in pretraining for atomic property prediction.
中文: 本研究证明,通过新型化学相似性指数筛选的小型任务相关数据集进行预训练,仅需极低计算成本即可超越大规模预训练效果,揭示了原子性质预测中质量优于数量的规律。
English: This study demonstrates that pretraining on a small, task-relevant dataset using a novel Chemical Similarity Index can outperform large-scale pretraining with only a fraction of computational cost, emphasizing that quality trumps quantity in atomic property prediction.

Authors:Dawid Malarz, Artur Kasymov, Maciej Zięba, Jacek Tabor, Przemysław Spurek
Title: Classifier-free Guidance with Adaptive Scaling
Abstract:
Classifier-free guidance (CFG) is an essential mechanism in contemporary text-driven diffusion models. In practice, in controlling the impact of guidance we can see the trade-off between the quality of the generated images and correspondence to the prompt. When we use strong guidance, generated images fit the conditioned text perfectly but at the cost of their quality. Dually, we can use small guidance to generate high-quality results, but the generated images do not suit our prompt. In this paper, we present $β$-CFG ($β$-adaptive scaling in Classifier-Free Guidance), which controls the impact of guidance during generation to solve the above trade-off. First, $β$-CFG stabilizes the effects of guiding by gradient-based adaptive normalization. Second, $β$-CFG uses the family of single-modal ($β$-distribution), time-dependent curves to dynamically adapt the trade-off between prompt matching and the quality of samples during the diffusion denoising process. Our model obtained better FID scores, maintaining the text-to-image CLIP similarity scores at a level similar to that of the reference CFG.
无分类器引导(CFG)在文本驱动扩散模型中存在图像质量与提示匹配之间的权衡,而提出的$β$-CFG方法通过生成过程中自适应调整引导强度,有效平衡了这两方面因素。
Classifier-free guidance (CFG) in text-driven diffusion models faces a trade-off between image quality and prompt alignment, which the proposed $β$-CFG method addresses by adaptively scaling guidance during generation to balance these factors effectively.

Authors:Jiuyu Liu, Chunmei Xu, Yi Ma, Rahim Tafazolli, Ahmed Elzanaty
Title: ELAA-ISAC: Environmental Mapping Utilizing the LoS State of Communication Channel
Abstract:
In this paper, a novel environmental mapping method is proposed to outline the indoor layout utilizing the line-of-sight (LoS) state information of extremely large aperture array (ELAA) channels. It leverages the spatial resolution provided by ELAA and the mobile terminal (MT)'s mobility to infer the presence and location of obstacles in the environment. The LoS state estimation is formulated as a binary hypothesis testing problem, and the optimal decision rule is derived based on the likelihood ratio test. Subsequently, the theoretical error probability of LoS estimation is derived, showing close alignment with simulation results. Then, an environmental mapping method is proposed, which progressively outlines the layout by combining LoS state information from multiple MT locations. It is demonstrated that the proposed method can accurately outline the environment layout, with the mapping accuracy improving as the number of service-antennas and MT locations increases. This paper also investigates the impact of channel estimation error and non-LoS (NLoS) components on the quality of environmental mapping. The proposed method exhibits particularly promising performance in LoS dominated wireless environments characterized by high Rician K-factor. Specifically, it achieves an average intersection over union (IoU) exceeding 80% when utilizing 256 service antennas and 18 MT locations.
Chinese: 本文提出了一种新型环境映射方法,利用极大孔径阵列信道的视距状态信息和移动终端流动性来精确勾勒室内布局,在使用256个服务天线和18个移动终端位置时,平均交并比超过80%。
English: This paper introduces a novel environmental mapping method that uses line-of-sight state information from extremely large aperture array channels and mobile terminal mobility to accurately outline indoor layouts, achieving over 80% average intersection over union with sufficient antennas and locations.

Authors:Yaming Yang, Zhe Wang, Ziyu Guan, Wei Zhao, Xinyan Huang, Xiaofei He
Title: Unsupervised Entity Alignment Based on Personalized Discriminative Rooted Tree
Abstract:
Entity Alignment (EA) is to link potential equivalent entities across different knowledge graphs (KGs). Most existing EA methods are supervised as they require the supervision of seed alignments, i.e., manually specified aligned entity pairs. Very recently, several EA studies have made some attempts to get rid of seed alignments. Despite achieving preliminary progress, they still suffer two limitations: (1) The entity embeddings produced by their GNN-like encoders lack personalization since some of the aggregation subpaths are shared between different entities. (2) They cannot fully alleviate the distribution distortion issue between candidate KGs due to the absence of the supervised signal. In this work, we propose a novel unsupervised entity alignment approach called UNEA to address the above two issues. First, we parametrically sample a tree neighborhood rooted at each entity, and accordingly develop a tree attention aggregation mechanism to extract a personalized embedding for each entity. Second, we introduce an auxiliary task of maximizing the mutual information between the input and the output of the KG encoder, to regularize the model and prevent the distribution distortion. Extensive experiments show that our UNEA achieves a new state-of-the-art for the unsupervised EA task, and can even outperform many existing supervised EA baselines.
中文摘要:提出的UNEA方法通过个性化树注意力嵌入和互信息最大化解决无监督实体对齐中的分布失真问题,实现了最先进的性能。
English Summary: The proposed UNEA method addresses limitations in unsupervised entity alignment by using personalized tree attention embeddings and mutual information maximization to prevent distribution distortion, achieving state-of-the-art performance.

Authors:Yaqian Chen, Hanxue Gu, Yuwen Chen, Jichen Yang, Haoyu Dong, Joseph Y. Cao, Adrian Camarena, Christopher Mantyh, Roy Colglazier, Maciej A. Mazurowski
Title: Automated Muscle and Fat Segmentation in Computed Tomography for Comprehensive Body Composition Analysis
Abstract:
Body composition assessment using CT images can potentially be used for a number of clinical applications, including the prognostication of cardiovascular outcomes, evaluation of metabolic health, monitoring of disease progression, assessment of nutritional status, prediction of treatment response in oncology, and risk stratification for surgical and critical care outcomes. While multiple groups have developed in-house segmentation tools for this analysis, there are very limited publicly available tools that could be consistently used across different applications. To mitigate this gap, we present a publicly accessible, end-to-end segmentation and feature calculation model specifically for CT body composition analysis. Our model performs segmentation of skeletal muscle, subcutaneous adipose tissue (SAT), and visceral adipose tissue (VAT) across the chest, abdomen, and pelvis area in axial CT images. It also provides various body composition metrics, including muscle density, visceral-to-subcutaneous fat (VAT/SAT) ratio, muscle area/volume, and skeletal muscle index (SMI), supporting both 2D and 3D assessments. To evaluate the model, the segmentation was applied to both internal and external datasets, with body composition metrics analyzed across different age, sex, and race groups. The model achieved high dice coefficients on both internal and external datasets, exceeding 89% for skeletal muscle, SAT, and VAT segmentation. The model outperforms the benchmark by 2.40% on skeletal muscle and 10.26% on SAT compared to the manual annotations given by the publicly available dataset. Body composition metrics show mean relative absolute errors (MRAEs) under 10% for all measures. Furthermore, the model provided muscular fat segmentation with a Dice coefficient of 56.27%, which can be utilized for additional analyses as needed.
中文: 本研究推出了一种公开可用的端到端CT体成分分析模型,能精确分割躯干多个区域的骨骼肌、皮下及内脏脂肪组织,提供全面的体成分指标,并在内外验证中表现出色,关键组织的Dice系数超过89%。
English: This study introduces a publicly available, end-to-end model for CT body composition analysis that accurately segments skeletal muscle, subcutaneous and visceral adipose tissues across multiple body regions, providing comprehensive metrics and demonstrating high performance in both internal and external validations with dice coefficients over 89% for key tissues.

Authors:Youming Deng, Wenqi Xian, Guandao Yang, Leonidas Guibas, Gordon Wetzstein, Steve Marschner, Paul Debevec
Title: Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction
Abstract:
In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. In particular, our technique enables high-quality scene reconstruction from Large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. Our approach introduces a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortion, and demonstrates state-of-the-art performance on both synthetic and real-world datasets.
中文: 本文提出了一种自校准框架,通过联合优化相机参数、镜头畸变和3D高斯表示,结合混合网络和立方体贴图重采样策略,实现了从广角图像进行高质量场景重建,并在合成与真实数据集上达到最优性能。
English: This paper introduces a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations for high-quality scene reconstruction from wide-angle images, achieving state-of-the-art accuracy through a hybrid network and cubemap-based resampling strategy.

Authors:Anjian Li, Sangjae Bae, David Isele, Ryne Beeson, Faizan M. Tariq
Title: Predictive Planner for Autonomous Driving with Consistency Models
Abstract:
Trajectory prediction and planning are essential for autonomous vehicles to navigate safely and efficiently in dynamic environments. Traditional approaches often treat them separately, limiting the ability for interactive planning. While recent diffusion-based generative models have shown promise in multi-agent trajectory generation, their slow sampling is less suitable for high-frequency planning tasks. In this paper, we leverage the consistency model to build a predictive planner that samples from a joint distribution of ego and surrounding agents, conditioned on the ego vehicle's navigational goal. Trained on real-world human driving datasets, our consistency model generates higher-quality trajectories with fewer sampling steps than standard diffusion models, making it more suitable for real-time deployment. To enforce multiple planning constraints simultaneously on the ego trajectory, a novel online guided sampling approach inspired by the Alternating Direction Method of Multipliers (ADMM) is introduced. Evaluated on the Waymo Open Motion Dataset (WOMD), our method enables proactive behavior such as nudging and yielding, and also demonstrates smoother, safer, and more efficient trajectories and satisfaction of multiple constraints under a limited computational budget.
中文摘要:本文提出了一种基于一致性模型的预测规划器,通过整合轨迹预测与规划,在实时条件下为自动驾驶车辆生成高质量交互轨迹,并采用新型引导采样方法确保多重规划约束的同时满足。
English Summary: This paper introduces a consistency model-based predictive planner that generates high-quality, interactive trajectories for autonomous vehicles in real-time by integrating trajectory prediction and planning, while enforcing multiple constraints through a novel guided sampling approach.

Authors:Senkang Hu, Yihang Tao, Zihan Fang, Guowen Xu, Yiqin Deng, Sam Kwong, Yuguang Fang
Title: CP-Guard+: A New Paradigm for Malicious Agent Detection and Defense in Collaborative Perception
Abstract:
Collaborative perception (CP) is a promising method for safe connected and autonomous driving, which enables multiple vehicles to share sensing information to enhance perception performance. However, compared with single-vehicle perception, the openness of a CP system makes it more vulnerable to malicious attacks that can inject malicious information to mislead the perception of an ego vehicle, resulting in severe risks for safe driving. To mitigate such vulnerability, we first propose a new paradigm for malicious agent detection that effectively identifies malicious agents at the feature level without requiring verification of final perception results, significantly reducing computational overhead. Building on this paradigm, we introduce CP-GuardBench, the first comprehensive dataset provided to train and evaluate various malicious agent detection methods for CP systems. Furthermore, we develop a robust defense method called CP-Guard+, which enhances the margin between the representations of benign and malicious features through a carefully designed Dual-Centered Contrastive Loss (DCCLoss). Finally, we conduct extensive experiments on both CP-GuardBench and V2X-Sim, and demonstrate the superiority of CP-Guard+.
Chinese: 协作感知系统易受恶意攻击,但提出的CP-Guard+方法通过新数据集和对比损失在特征层面有效检测恶意代理,实验证明其优越性能。
English: Collaborative perception systems for autonomous driving are vulnerable to malicious attacks, but the proposed CP-Guard+ method effectively detects malicious agents at the feature level using a novel dataset and contrastive loss, demonstrating superior performance in experiments.

Authors:Saurav Sharma, Maria Vannucci, Leonardo Pestana Legori, Mario Scaglia, Giovanni Guglielmo Laracca, Didier Mutter, Sergio Alfieri, Pietro Mascagni, Nicolas Padoy
Title: Early Operative Difficulty Assessment in Laparoscopic Cholecystectomy via Snapshot-Centric Video Analysis
Abstract:
Purpose: Laparoscopic cholecystectomy (LC) operative difficulty (LCOD) is highly variable and influences outcomes. Despite extensive LC studies in surgical workflow analysis, limited efforts explore LCOD using intraoperative video data. Early recognition of LCOD could allow prompt review by expert surgeons, enhance operating room (OR) planning, and improve surgical outcomes. Methods: We propose the clinical task of early LCOD assessment using limited video observations. We design SurgPrOD, a deep learning model to assess LCOD by analyzing features from global and local temporal resolutions (snapshots) of the observed LC video. Also, we propose a novel snapshot-centric attention (SCA) module, acting across snapshots, to enhance LCOD prediction. We introduce the CholeScore dataset, featuring video-level LCOD labels to validate our method. Results: We evaluate SurgPrOD on 3 LCOD assessment scales in the CholeScore dataset. On our new metric assessing early and stable correct predictions, SurgPrOD surpasses baselines by at least 0.22 points. SurgPrOD improves over baselines by at least 9 and 5 percentage points in F1 score and top1-accuracy, respectively, demonstrating its effectiveness in correct predictions. Conclusion: We propose a new task for early LCOD assessment and a novel model, SurgPrOD analyzing surgical video from global and local perspectives. Our results on the CholeScore dataset establishes a new benchmark to study LCOD using intraoperative video data.
中文: 本研究提出SurgPrOD深度学习模型,通过分析腹腔镜胆囊切除术视频的全局和局部特征,实现了对手术难度的早期评估,在CholeScore数据集上表现优于基线方法并建立了新标准。
English: This study introduces SurgPrOD, a deep learning model for early assessment of laparoscopic cholecystectomy operative difficulty using limited video data, which outperforms baselines in accuracy and establishes a new benchmark with the CholeScore dataset.

Authors:Boqun Zhao, Chongjun Ouyang, Xingqi Zhang, Hyundong Shin, Yuanwei Liu
Title: Downlink and Uplink ISAC in Continuous-Aperture Array (CAPA) Systems
Abstract:
A continuous-aperture array (CAPA)-based integrated sensing and communications (ISAC) framework is proposed for both downlink and uplink scenarios. Within this framework, continuous operator-based signal models are employed to describe the sensing and communication processes. The performance of communication and sensing is analyzed using two information-theoretic metrics: the communication rate (CR) and the sensing rate (SR). 1) For downlink ISAC, three continuous beamforming designs are proposed: i) the communications-centric (C-C) design that maximizes the CR, ii) the sensing-centric (S-C) design that maximizes the SR, and iii) the Pareto-optimal design that characterizes the Pareto boundary of the CR-SR region. A low-complexity signal subspace-based approach is proposed to derive the closed-form optimal beamformers for the considered designs. On this basis, closed-form expressions are derived for the achievable CRs and SRs, and the downlink rate region achieved by CAPAs is characterized. 2) For uplink ISAC, the C-C and S-C successive interference cancellation-based methods are proposed to manage inter-functionality interference. Using the subspace approach closed-form expressions for the optimal detectors as well as the achievable CRs and SRs are derived. The uplink SR-CR region is characterized based on the time-sharing technique. Numerical results demonstrate that, for both downlink and uplink, CAPA-based ISAC achieves higher CRs and SRs as well as larger CR-SR regions compared to conventional spatially discrete array-based ISAC.
中文: 该研究提出了一种基于连续孔径阵列的通感一体化框架,采用连续信号模型和信息论指标优化下行链路的波束成形设计与上行链路的检测方法,相比传统离散阵列实现了更高的通信速率、感知速率及更广的速率区域。
English: The proposed continuous-aperture array-based integrated sensing and communications framework employs continuous signal models and information-theoretic metrics to optimize beamforming designs for downlink and detectors for uplink, demonstrating superior performance over conventional arrays through higher communication and sensing rates with expanded rate regions.

Authors:Anna Tegon, Thorir Mar Ingolfsson, Xiaying Wang, Luca Benini, Yawei Li
Title: FEMBA: Efficient and Scalable EEG Analysis with a Bidirectional Mamba Foundation Model
Abstract:
Accurate and efficient electroencephalography (EEG) analysis is essential for detecting seizures and artifacts in long-term monitoring, with applications spanning hospital diagnostics to wearable health devices. Robust EEG analytics have the potential to greatly improve patient care. However, traditional deep learning models, especially Transformer-based architectures, are hindered by their quadratic time and memory complexity, making them less suitable for resource-constrained environments. To address these challenges, we present FEMBA (Foundational EEG Mamba + Bidirectional Architecture), a novel self-supervised framework that establishes new efficiency benchmarks for EEG analysis through bidirectional state-space modeling. Unlike Transformer-based models, which incur quadratic time and memory complexity, FEMBA scales linearly with sequence length, enabling more scalable and efficient processing of extended EEG recordings. Trained on over 21,000 hours of unlabeled EEG and fine-tuned on three downstream tasks, FEMBA achieves competitive performance in comparison with transformer models, with significantly lower computational cost. Specifically, it reaches 81.82% balanced accuracy (0.8921 AUROC) on TUAB and 0.949 AUROC on TUAR, while a tiny 7.8M-parameter variant demonstrates viability for resource-constrained devices. These results pave the way for scalable, general-purpose EEG analytics in both clinical and highlight FEMBA as a promising candidate for wearable applications.
中文: FEMBA提出了一种基于双向状态空间建模的自监督框架,实现了线性扩展的高效脑电图分析,在计算成本显著低于Transformer模型的同时保持了竞争力,并展现出在可穿戴设备中的应用潜力。
English: FEMBA introduces a self-supervised framework using bidirectional state-space modeling to enable linear-scaling, efficient EEG analysis, achieving competitive performance with lower computational costs than Transformers and demonstrating viability for wearable devices.

Authors:Junyu Lu, Kai Ma, Kaichun Wang, Kelaiti Xiao, Roy Ka-Wei Lee, Bo Xu, Liang Yang, Hongfei Lin
Title: Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement
Abstract:
Large Language Models (LLMs) have become essential for offensive language detection, yet their ability to handle annotation disagreement remains underexplored. Disagreement samples, which arise from subjective interpretations, pose a unique challenge due to their ambiguous nature. Understanding how LLMs process these cases, particularly their confidence levels, can offer insight into their alignment with human annotators. This study systematically evaluates the performance of multiple LLMs in detecting offensive language at varying levels of annotation agreement. We analyze binary classification accuracy, examine the relationship between model confidence and human disagreement, and explore how disagreement samples influence model decision-making during few-shot learning and instruction fine-tuning. Our findings reveal that LLMs struggle with low-agreement samples, often exhibiting overconfidence in these ambiguous cases. However, utilizing disagreement samples in training improves both detection accuracy and model alignment with human judgment. These insights provide a foundation for enhancing LLM-based offensive language detection in real-world moderation tasks.
中文: 研究发现大语言模型在处理低一致性冒犯性语言样本时表现不佳且常过度自信,但在训练中使用这些样本可提升检测准确性及与人类判断的一致性。
English: This study finds that large language models (LLMs) struggle with low-agreement offensive language samples, often showing overconfidence, but incorporating these samples in training improves detection accuracy and alignment with human judgment.

Authors:Amir Saeidi, Yiran Luo, Agneet Chatterjee, Shamanthak Hegde, Bimsara Pathiraja, Yezhou Yang, Chitta Baral
Title: Dual Caption Preference Optimization for Diffusion Models
Abstract:
Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, existing preference datasets often exhibit overlap between these distributions, leading to a conflict distribution. Additionally, we identified that input prompts contain irrelevant information for less preferred images, limiting the denoising network's ability to accurately predict noise in preference optimization methods, known as the irrelevant prompt issue. To address these challenges, we propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. To tackle conflict distribution, we introduce the Pick-Double Caption dataset, a modified version of Pick-a-Pic v2 with separate captions for preferred and less preferred images. We further propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.
Chinese Summary: 提出的双标题偏好优化(DCPO)方法通过使用双标题和专门设计的数据集,解决了偏好优化中的分布冲突和无关提示问题,在多项评估指标上显著优于现有模型。
English Summary: The proposed Dual Caption Preference Optimization (DCPO) method addresses distribution conflicts and irrelevant prompts in preference optimization by using dual captions and a specially designed dataset, significantly outperforming existing models across multiple evaluation metrics.

Authors:Shantian Qin, Ziqing Qiang, Zhihua Fan, Wenming Li, Xuejun An, Xiaochun Ye, Dongrui Fan
Title: StreamDCIM: A Tile-based Streaming Digital CIM Accelerator with Mixed-stationary Cross-forwarding Dataflow for Multimodal Transformer
Abstract:
Multimodal Transformers are emerging artificial intelligence (AI) models designed to process a mixture of signals from diverse modalities. Digital computing-in-memory (CIM) architectures are considered promising for achieving high efficiency while maintaining high accuracy. However, current digital CIM-based accelerators exhibit inflexibility in microarchitecture, dataflow, and pipeline to effectively accelerate multimodal Transformer. In this paper, we propose StreamDCIM, a tile-based streaming digital CIM accelerator for multimodal Transformers. It overcomes the above challenges with three features: First, we present a tile-based reconfigurable CIM macro microarchitecture with normal and hybrid reconfigurable modes to improve intra-macro CIM utilization. Second, we implement a mixed-stationary cross-forwarding dataflow with tile-based execution decoupling to exploit tile-level computation parallelism. Third, we introduce a ping-pong-like fine-grained compute-rewriting pipeline to overlap high-latency on-chip CIM rewriting. Experimental results show that StreamDCIM outperforms non-streaming and layer-based streaming CIM-based solutions by geomean 2.63$\times$ and 1.28$\times$ on typical multimodal Transformer models.
Chinese: StreamDCIM是一种基于图块的流式数字内存计算加速器,专为多模态Transformer设计,通过可重构微架构、混合静态数据流和细粒度流水线技术,在典型模型上实现了超越现有方案的显著性能提升。
English: StreamDCIM is a tile-based streaming digital computing-in-memory accelerator designed for multimodal Transformers, featuring reconfigurable microarchitecture, mixed-stationary dataflow, and fine-grained pipelining to achieve significant performance improvements over existing solutions.

Authors:Weihao Cui, Ji Zhang, Han Zhao, Chao Liu, Wenhao Zhang, Jian Sha, Quan Chen, Bingsheng He, Minyi Guo
Title: XPUTimer: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale
Abstract:
The rapid proliferation of large language models has driven the need for efficient GPU training clusters. However, ensuring high-performance training in these clusters is challenging due to the complexity of software-hardware interactions and the frequent occurrence of training anomalies. Since existing diagnostic tools are narrowly tailored to specific issues, there are gaps in their ability to address anomalies spanning the entire training stack. In response, we introduce XPUTimer, a real-time diagnostic framework designed for distributed LLM training at scale. XPUTimer first integrates a lightweight tracing daemon to monitor key code segments with minimal overhead. Additionally, it features a diagnostic engine that employs novel intra-kernel tracing and holistic aggregated metrics to efficiently identify and resolve anomalies. Deployment of XPUTimer across 6,000 GPUs over eight months demonstrated significant improvements across the training stack, validating its effectiveness in real-world scenarios.
中文: XPUTimer是一个实时诊断框架,通过集成轻量级追踪和诊断引擎,有效识别并解决大规模分布式大语言模型训练中的异常问题,在6000个GPU上验证了其显著性能提升。
English: XPUTimer is a real-time diagnostic framework that integrates lightweight tracing and a diagnostic engine to efficiently identify and resolve anomalies in large-scale distributed LLM training, demonstrating significant improvements across 6,000 GPUs.

Authors:Geliang Ouyang, Jingyao Chen, Zhihe Nie, Yi Gui, Yao Wan, Hongyu Zhang, Dongping Chen
Title: nvAgent: Automated Data Visualization from Natural Language via Collaborative Agent Workflow
Abstract:
Natural Language to Visualization (NL2Vis) seeks to convert natural-language descriptions into visual representations of given tables, empowering users to derive insights from large-scale data. Recent advancements in Large Language Models (LLMs) show promise in automating code generation to transform tabular data into accessible visualizations. However, they often struggle with complex queries that require reasoning across multiple tables. To address this limitation, we propose a collaborative agent workflow, termed nvAgent, for NL2Vis. Specifically, nvAgent comprises three agents: a processor agent for database processing and context filtering, a composer agent for planning visualization generation, and a validator agent for code translation and output verification. Comprehensive evaluations on the new VisEval benchmark demonstrate that nvAgent consistently surpasses state-of-the-art baselines, achieving a 7.88% improvement in single-table and a 9.23% improvement in multi-table scenarios. Qualitative analyses further highlight that nvAgent maintains nearly a 20% performance margin over previous models, underscoring its capacity to produce high-quality visual representations from complex, heterogeneous data sources.
中文: NL2Vis旨在将自然语言转换为表格数据的可视化,而提出的nvAgent框架通过三个专业代理的协作流程,显著超越现有方法,在单表和多表场景下分别提升准确率7.88%和9.23%。
English: NL2Vis aims to translate natural language into visualizations from tabular data, and the proposed nvAgent framework, featuring a collaborative workflow of three specialized agents, significantly outperforms existing methods by improving accuracy by 7.88% for single-table and 9.23% for multi-table scenarios.

Authors:Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu
Title: Goku: Flow Based Video Generative Foundation Models
Abstract:
This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.
中文: 本文介绍了Goku模型系列,它采用整流流Transformer实现联合图像与视频生成,在定性和定量评估中均创下行业新标杆。
English: This paper presents Goku, a cutting-edge family of joint image-and-video generation models using rectified flow Transformers that set new industry benchmarks in both qualitative and quantitative evaluations.

Authors:Leixian Shen, Haotian Li, Yun Wang, Huamin Qu
Title: Reflecting on Design Paradigms of Animated Data Video Tools
Abstract:
Animated data videos have gained significant popularity in recent years. However, authoring data videos remains challenging due to the complexity of creating and coordinating diverse components (e.g., visualization, animation, audio, etc.). Although numerous tools have been developed to streamline the process, there is a lack of comprehensive understanding and reflection of their design paradigms to inform future development. To address this gap, we propose a framework for understanding data video creation tools along two dimensions: what data video components to create and coordinate, including visual, motion, narrative, and audio components, and how to support the creation and coordination. By applying the framework to analyze 46 existing tools, we summarized key design paradigms of creating and coordinating each component based on the varying work distribution for humans and AI in these tools. Finally, we share our detailed reflections, highlight gaps from a holistic view, and discuss future directions to address them.
Chinese: 本文提出一个分析数据视频创作工具的框架,从处理组件和协调方式两个维度出发,通过研究46种工具总结了关键设计模式,并指出了未来发展的方向与改进空间。
English: This paper introduces a framework to analyze data video creation tools by examining what components they handle and how they support their coordination, identifying design paradigms through a review of 46 tools and suggesting future improvements.

Authors:Chao Feng, Yunlong Li, Yuanzhe Gao, Alberto Huertas Celdrán, Jan von der Assen, Gérôme Bovet, Burkhard Stiller
Title: DMPA: Model Poisoning Attacks on Decentralized Federated Learning for Model Differences
Abstract:
Federated learning (FL) has garnered significant attention as a prominent privacy-preserving Machine Learning (ML) paradigm. Decentralized FL (DFL) eschews traditional FL's centralized server architecture, enhancing the system's robustness and scalability. However, these advantages of DFL also create new vulnerabilities for malicious participants to execute adversarial attacks, especially model poisoning attacks. In model poisoning attacks, malicious participants aim to diminish the performance of benign models by creating and disseminating the compromised model. Existing research on model poisoning attacks has predominantly concentrated on undermining global models within the Centralized FL (CFL) paradigm, while there needs to be more research in DFL. To fill the research gap, this paper proposes an innovative model poisoning attack called DMPA. This attack calculates the differential characteristics of multiple malicious client models and obtains the most effective poisoning strategy, thereby orchestrating a collusive attack by multiple participants. The effectiveness of this attack is validated across multiple datasets, with results indicating that the DMPA approach consistently surpasses existing state-of-the-art FL model poisoning attack strategies.
中文: 本文提出了一种名为DMPA的创新模型投毒攻击方法,通过计算多个恶意客户端模型的差异特征来制定最优投毒策略,在多个数据集上的实验表明该攻击方法始终优于现有最先进的联邦学习投毒策略。
English: This paper introduces DMPA, an innovative model poisoning attack in decentralized federated learning that leverages differential characteristics among malicious client models to orchestrate collusive attacks, demonstrating superior effectiveness over existing methods across multiple datasets.

Authors:Xihao Yuan, Siqi Liu, Hanting Chen, Lu Zhou, Jian Li, Jie Hu
Title: Dynamic Frequency-Adaptive Knowledge Distillation for Speech Enhancement
Abstract:
Deep learning-based speech enhancement (SE) models have recently outperformed traditional techniques, yet their deployment on resource-constrained devices remains challenging due to high computational and memory demands. This paper introduces a novel dynamic frequency-adaptive knowledge distillation (DFKD) approach to effectively compress SE models. Our method dynamically assesses the model's output, distinguishing between high and low-frequency components, and adapts the learning objectives to meet the unique requirements of different frequency bands, capitalizing on the SE task's inherent characteristics. To evaluate the DFKD's efficacy, we conducted experiments on three state-of-the-art models: DCCRN, ConTasNet, and DPTNet. The results demonstrate that our method not only significantly enhances the performance of the compressed model (student model) but also surpasses other logit-based knowledge distillation methods specifically for SE tasks.
中文摘要:本文提出一种动态频率自适应知识蒸馏方法,通过针对不同频段调整学习目标来有效压缩语音增强模型,其性能优于其他蒸馏技术。
English Summary: This paper presents a dynamic frequency-adaptive knowledge distillation method that effectively compresses speech enhancement models by adapting learning objectives to different frequency bands, achieving superior performance over other distillation techniques.

Authors:Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, Tat-Seng Chua
Title: DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization
Abstract:
Text-to-3D generation automates 3D content creation from textual descriptions, which offers transformative potential across various fields. However, existing methods often struggle to align generated content with human preferences, limiting their applicability and flexibility. To address these limitations, in this paper, we propose DreamDPO, an optimization-based framework that integrates human preferences into the 3D generation process, through direct preference optimization. Practically, DreamDPO first constructs pairwise examples, then compare their alignment with human preferences using reward or large multimodal models, and lastly optimizes the 3D representation with a preference-driven loss function. By leveraging pairwise comparison to reflect preferences, DreamDPO reduces reliance on precise pointwise quality evaluations while enabling fine-grained controllability through preference-guided optimization. Experiments demonstrate that DreamDPO achieves competitive results, and provides higher-quality and more controllable 3D content compared to existing methods. The code and models will be open-sourced.
中文: DreamDPO是一种基于优化的框架,通过成对比较和偏好驱动优化将人类偏好融入文本到3D生成过程,相比现有方法能产生更高质量且可控性更强的3D内容。
English: DreamDPO is an optimization-based framework that integrates human preferences into text-to-3D generation through pairwise comparisons and preference-driven optimization, producing higher-quality and more controllable 3D content than existing methods.

Authors:Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra
Title: ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization
Abstract:
The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
中文: ParetoQ作为首个统一框架,可在不同量化比特间进行严谨比较,揭示了2至3比特间的关键学习转变,并以更少参数实现了优于以往方法的精度。
English: ParetoQ is a unified framework enabling rigorous comparisons across various quantization bit widths, revealing a critical learning transition between 2 and 3 bits and achieving superior accuracy with fewer parameters than previous methods.

Authors:Maohao Shen, Guangtao Zeng, Zhenting Qi, Zhang-Wei Hong, Zhenfang Chen, Wei Lu, Gregory Wornell, Subhro Das, David Cox, Chuang Gan
Title: Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
Abstract:
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models are fully open-sourced.
中文: 本研究提出行动思维链推理和两阶段训练方法,旨在增强单个大语言模型的自主推理能力,最终开发的70亿参数模型Satori在数学推理任务中达到顶尖水平并展现出强大的泛化能力。
English: This research introduces Chain-of-Action-Thought reasoning and a two-stage training method to enhance a single LLM's autonomous reasoning, resulting in the 7B model Satori that achieves top performance in mathematical reasoning and strong generalization.

Authors:Hongxin Li, Jingfan Chen, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang
Title: AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
Abstract:
User interface understanding with vision-language models (VLMs) has received much attention due to its potential for enhancing software automation. However, existing datasets used to build UI-VLMs either only contain large-scale context-free element annotations or contextualized functional descriptions for elements at a small scale. In this work, we propose the \textbf{AutoGUI} pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing UI state changes before and after simulated interactions. To improve annotation quality, we propose LLM-aided rejection and verification, eliminating invalid annotations without human labor. We construct a high-quality AutoGUI-704k dataset using the proposed pipeline, featuring diverse and detailed functionality annotations that are hardly provided by previous datasets. Human evaluation shows that we achieve annotation correctness comparable to a trained human annotator. Extensive experiments show that our dataset remarkably enhances VLM's UI grounding capabilities and exhibits significant scaling effects. We also show the interesting potential use of our dataset in UI agent tasks. Please view our project at https://autogui-project.github.io/.
中文:AutoGUI 管道利用大语言模型自动为界面元素添加详细功能标注,构建的高质量数据集显著提升了视觉语言模型对用户界面的理解与定位能力,并展现出强大的扩展潜力。
English: The AutoGUI pipeline automates large-scale UI element annotation with detailed functionality descriptions using LLMs, creating a high-quality dataset that significantly enhances vision-language models' UI understanding and grounding capabilities.

Authors:Raja Marjieh, Veniamin Veselovsky, Thomas L. Griffiths, Ilia Sucholutsky
Title: What is a Number, That a Large Language Model May Know It?
Abstract:
Numbers are a basic part of how humans represent and describe the world around them. As a consequence, learning effective representations of numbers is critical for the success of large language models as they become more integrated into everyday decisions. However, these models face a challenge: depending on context, the same sequence of digit tokens, e.g., 911, can be treated as a number or as a string. What kind of representations arise from this duality, and what are its downstream implications? Using a similarity-based prompting technique from cognitive science, we show that LLMs learn representational spaces that blend string-like and numerical representations. In particular, we show that elicited similarity judgments from these models over integer pairs can be captured by a combination of Levenshtein edit distance and numerical Log-Linear distance, suggesting an entangled representation. In a series of experiments we show how this entanglement is reflected in the latent embeddings, how it can be reduced but not entirely eliminated by context, and how it can propagate into a realistic decision scenario. These results shed light on a representational tension in transformer models that must learn what a number is from text input.
中文:大型语言模型形成了数字的纠缠表征,融合了字符串与数值特性,这种特性虽可通过上下文部分减轻,但仍会持续影响实际决策场景。
English: Large language models develop entangled representations of numbers that blend string-like and numerical properties, which can be partially reduced by context but persist in decision-making scenarios.

Authors:Junghun Lee, Hyunju Kim, Fanchen Bu, Jihoon Ko, Kijung Shin
Title: DiffIM: Differentiable Influence Minimization with Surrogate Modeling and Continuous Relaxation
Abstract:
In social networks, people influence each other through social links, which can be represented as propagation among nodes in graphs. Influence minimization (IMIN) is the problem of manipulating the structures of an input graph (e.g., removing edges) to reduce the propagation among nodes. IMIN can represent time-critical real-world applications, such as rumor blocking, but IMIN is theoretically difficult and computationally expensive. Moreover, the discrete nature of IMIN hinders the usage of powerful machine learning techniques, which requires differentiable computation. In this work, we propose DiffIM, a novel method for IMIN with two differentiable schemes for acceleration: (1) surrogate modeling for efficient influence estimation, which avoids time-consuming simulations (e.g., Monte Carlo), and (2) the continuous relaxation of decisions, which avoids the evaluation of individual discrete decisions (e.g., removing an edge). We further propose a third accelerating scheme, gradient-driven selection, that chooses edges instantly based on gradients without optimization (spec., gradient descent iterations) on each test instance. Through extensive experiments on real-world graphs, we show that each proposed scheme significantly improves speed with little (or even no) IMIN performance degradation. Our method is Pareto-optimal (i.e., no baseline is faster and more effective than it) and typically several orders of magnitude (spec., up to 15,160X) faster than the most effective baseline while being more effective.
Chinese: DiffIM 提出了三种可微加速方案——替代建模、连续松弛和梯度驱动选择,以高效减少社交网络中的影响传播,在保持高性能的同时,相比现有方法速度提升最高达15,160倍。
English: DiffIM introduces three differentiable acceleration schemes—surrogate modeling, continuous relaxation, and gradient-driven selection—to efficiently minimize influence propagation in social networks while maintaining high performance and achieving up to 15,160X speedup over existing methods.

Authors:Chi Zhou, Wang Luo, Haoran Li, Congying Han, Tiande Guo, Zicheng Zhang
Title: Dual Alignment Maximin Optimization for Offline Model-based RL
Abstract:
Offline reinforcement learning agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. While most prior research has focused on improving the fidelity of synthetic sampling and incorporating off-policy mechanisms, the directly integrated paradigm often fails to ensure consistent policy behavior in biased models and underlying environmental dynamics, which inherently arise from discrepancies between behavior and learning policies. In this paper, we first shift the focus from model reliability to policy discrepancies while optimizing for expected returns, and then self-consistently incorporate synthetic data, deriving a novel actor-critic paradigm, Dual Alignment Maximin Optimization (DAMO). It is a unified framework to ensure both model-environment policy consistency and synthetic and offline data compatibility. The inner minimization performs dual conservative value estimation, aligning policies and trajectories to avoid out-of-distribution states and actions, while the outer maximization ensures that policy improvements remain consistent with inner value estimates. Empirical evaluations demonstrate that DAMO effectively ensures model and policy alignments, achieving competitive performance across diverse benchmark tasks.
中文摘要:离线强化学习面临合成与真实数据分布不匹配的挑战,因此提出DAMO框架,通过双重保守价值估计与策略优化实现策略与数据对齐,确保模型性能一致性。
English Summary: Offline reinforcement learning struggles with synthetic-to-real mismatches, so the DAMO framework is introduced to align policies and data through dual conservative value estimation and policy improvement for consistent performance.

Authors:Chi Zhou, Wang Luo, Haoran Li, Congying Han, Tiande Guo, Zicheng Zhang
Title: Dual Alignment Maximin Optimization for Offline Model-based RL
Abstract:
Offline reinforcement learning agents face significant deployment challenges due to the synthetic-to-real distribution mismatch. While most prior research has focused on improving the fidelity of synthetic sampling and incorporating off-policy mechanisms, the directly integrated paradigm often fails to ensure consistent policy behavior in biased models and underlying environmental dynamics, which inherently arise from discrepancies between behavior and learning policies. In this paper, we first shift the focus from model reliability to policy discrepancies while optimizing for expected returns, and then self-consistently incorporate synthetic data, deriving a novel actor-critic paradigm, Dual Alignment Maximin Optimization (DAMO). It is a unified framework to ensure both model-environment policy consistency and synthetic and offline data compatibility. The inner minimization performs dual conservative value estimation, aligning policies and trajectories to avoid out-of-distribution states and actions, while the outer maximization ensures that policy improvements remain consistent with inner value estimates. Empirical evaluations demonstrate that DAMO effectively ensures model and policy alignments, achieving competitive performance across diverse benchmark tasks.
中文摘要:离线强化学习面临合成与真实数据分布不匹配的挑战,因此提出DAMO框架,通过双重保守价值估计与策略优化实现策略与数据对齐,确保模型性能一致性。
English Summary: Offline reinforcement learning struggles with synthetic-to-real mismatches, so the DAMO framework is introduced to align policies and data through dual conservative value estimation and policy improvement for consistent performance.

Authors:Jingqiu Zhou, Lue Fan, Linjiang Huang, Xiaoyu Shi, Si Liu, Zhaoxiang Zhang, Hongsheng Li
Title: FlexDrive: Toward Trajectory Flexibility in Driving Scene Reconstruction and Rendering
Abstract:
Driving scene reconstruction and rendering have advanced significantly using the 3D Gaussian Splatting. However, most prior research has focused on the rendering quality along a pre-recorded vehicle path and struggles to generalize to out-of-path viewpoints, which is caused by the lack of high-quality supervision in those out-of-path views. To address this issue, we introduce an Inverse View Warping technique to create compact and high-quality images as supervision for the reconstruction of the out-of-path views, enabling high-quality rendering results for those views. For accurate and robust inverse view warping, a depth bootstrap strategy is proposed to obtain on-the-fly dense depth maps during the optimization process, overcoming the sparsity and incompleteness of LiDAR depth data. Our method achieves superior in-path and out-of-path reconstruction and rendering performance on the widely used Waymo Open dataset. In addition, a simulator-based benchmark is proposed to obtain the out-of-path ground truth and quantitatively evaluate the performance of out-of-path rendering, where our method outperforms previous methods by a significant margin.
Chinese: 我们的方法引入了逆视图扭曲技术和深度引导策略,为路径外视角生成高质量监督数据,在Waymo开放数据集上实现了路径内外场景的卓越重建与渲染效果。
English: Our method introduces Inverse View Warping with a depth bootstrap strategy to generate high-quality supervision for out-of-path views, achieving superior reconstruction and rendering performance on both in-path and out-of-path scenarios in the Waymo Open dataset.

Authors:Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie
Title: HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
Abstract:
Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. \textbf{HAICTrain} comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, \textbf{HAICBench} includes 412 manually annotated video-caption pairs and 2,000 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.
Chinese: 本研究通过两阶段数据标注流程构建了HAICTrain和HAICBench数据集,提供高质量视频-字幕对,显著提升了多模态大语言模型在人类动作理解和文本到视频生成方面的性能。
English: This study introduces a two-stage data annotation pipeline to create HAICTrain and HAICBench datasets, enhancing Multi-modal Large Language Models' performance in human action understanding and text-to-video generation by providing high-quality video-caption pairs.

Authors:Gianluca Cena, Lucia Seno, Stefano Scanzio
Title: Robust Multicast Origin Authentication in MACsec and CANsec for Automotive Scenarios
Abstract:
Having everything interconnected through the Internet, including vehicle onboard systems, is making security a primary concern in the automotive domain as well. Although Ethernet and CAN XL provide link-level security based on symmetric cryptography, they do not support origin authentication for multicast transmissions. Asymmetric cryptography is unsuitable for networked embedded control systems with real-time constraints and limited computational resources. In these cases, solutions derived from the TESLA broadcast authentication protocol may constitute a more suitable option. In this paper, some such strategies are presented and analyzed that allow for multicast origin authentication, also improving robustness to frame losses by means of interleaved keychains. A flexible authentication mechanism that relies on a unified receiver is then proposed, which enables transmitters to select strategies at runtime, to achieve the best compromise among security, reliability, and resource consumption.
中文摘要:本文针对汽车网络中的组播源认证难题,提出基于TESLA协议的灵活策略,通过交错密钥链提升抗丢包能力,并采用统一接收器机制实现安全性与资源消耗的动态优化平衡。
English Summary: The paper addresses the challenge of multicast origin authentication in automotive networks by proposing flexible strategies derived from TESLA, utilizing interleaved keychains for enhanced robustness and a unified receiver mechanism for optimal security and resource balance.

Authors:Toru Lin, Kartik Sachdev, Linxi Fan, Jitendra Malik, Yuke Zhu
Title: Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids
Abstract:
Learning generalizable robot manipulation policies, especially for complex multi-fingered humanoids, remains a significant challenge. Existing approaches primarily rely on extensive data collection and imitation learning, which are expensive, labor-intensive, and difficult to scale. Sim-to-real reinforcement learning (RL) offers a promising alternative, but has mostly succeeded in simpler state-based or single-hand setups. How to effectively extend this to vision-based, contact-rich bimanual manipulation tasks remains an open question. In this paper, we introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three challenging dexterous manipulation tasks: grasp-and-reach, box lift and bimanual handover. Our method features an automated real-to-sim tuning module, a generalized reward formulation based on contact and object goals, a divide-and-conquer policy distillation framework, and a hybrid object representation strategy with modality-specific augmentation. We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors -- highlighting that vision-based dexterous manipulation via sim-to-real RL is not only viable, but also scalable and broadly applicable to real-world humanoid manipulation tasks.
中文: 本文提出了一种实用的仿真到现实强化学习方法,使人形机器人能够执行复杂的基于视觉的双臂操作任务,在面对未见物体时表现出高成功率和强大的适应能力。
English: This paper presents a practical sim-to-real reinforcement learning method that enables humanoid robots to perform complex vision-based bimanual manipulation tasks with high success rates and robust adaptability to unseen objects.

Authors:Yukang Yang, Declan Campbell, Kaixuan Huang, Mengdi Wang, Jonathan Cohen, Taylor Webb
Title: Emergent Symbolic Mechanisms Support Abstract Reasoning in Large Language Models
Abstract:
Many recent studies have found evidence for emergent reasoning capabilities in large language models (LLMs), but debate persists concerning the robustness of these capabilities, and the extent to which they depend on structured reasoning mechanisms. To shed light on these issues, we study the internal mechanisms that support abstract reasoning in LLMs. We identify an emergent symbolic architecture that implements abstract reasoning via a series of three computations. In early layers, symbol abstraction heads convert input tokens to abstract variables based on the relations between those tokens. In intermediate layers, symbolic induction heads perform sequence induction over these abstract variables. Finally, in later layers, retrieval heads predict the next token by retrieving the value associated with the predicted abstract variable. These results point toward a resolution of the longstanding debate between symbolic and neural network approaches, suggesting that emergent reasoning in neural networks depends on the emergence of symbolic mechanisms.
中文摘要:最新研究表明,大型语言模型通过符号抽象、符号归纳和值检索三个计算阶段,形成了实现抽象推理的涌现符号机制,为神经与符号人工智能方法的融合提供了新证据。
English summary: Recent research reveals that large language models develop emergent symbolic mechanisms for abstract reasoning through three computational stages—symbol abstraction, symbolic induction, and value retrieval—bridging the gap between neural and symbolic AI approaches.

Authors:Mengjie Xu, Yitao Zhu, Haotian Jiang, Jiaming Li, Zhenrong Shen, Sheng Wang, Haolin Huang, Xinyu Wang, Qing Yang, Han Zhang, Qian Wang
Title: MITracker: Multi-View Integration for Visual Object Tracking
Abstract:
Multi-view object tracking (MVOT) offers promising solutions to challenges such as occlusion and target loss, which are common in traditional single-view tracking. However, progress has been limited by the lack of comprehensive multi-view datasets and effective cross-view integration methods. To overcome these limitations, we compiled a Multi-View object Tracking (MVTrack) dataset of 234K high-quality annotated frames featuring 27 distinct objects across various scenes. In conjunction with this dataset, we introduce a novel MVOT method, Multi-View Integration Tracker (MITracker), to efficiently integrate multi-view object features and provide stable tracking outcomes. MITracker can track any object in video frames of arbitrary length from arbitrary viewpoints. The key advancements of our method over traditional single-view approaches come from two aspects: (1) MITracker transforms 2D image features into a 3D feature volume and compresses it into a bird's eye view (BEV) plane, facilitating inter-view information fusion; (2) we propose an attention mechanism that leverages geometric information from fused 3D feature volume to refine the tracking results at each view. MITracker outperforms existing methods on the MVTrack and GMTD datasets, achieving state-of-the-art performance. The code and the new dataset will be available at https://mii-laboratory.github.io/MITracker/.
中文: 作者提出了MVTrack数据集和MITracker方法,通过将2D特征转换为3D特征体并利用注意力机制整合多视角信息,在多目标跟踪任务中实现了最先进的性能表现。
English: The authors introduce the MVTrack dataset and MITracker, a novel multi-view object tracking method that transforms 2D features into 3D volumes and uses an attention mechanism to achieve state-of-the-art performance by effectively integrating cross-view information.

Authors:Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, Xin Liu
Title: Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
Abstract:
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal. To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.
中文:COMET是一种优化的专家混合系统,通过细粒度通信计算重叠技术有效消除通信瓶颈,在MoE层执行中实现最高1.96倍加速,并在万级GPU集群的生产环境中节省了数百万GPU小时。
English: COMET is an optimized Mixture-of-Experts system that uses fine-grained communication-computation overlapping to significantly reduce communication bottlenecks, achieving up to 1.96× speedup in MoE layer execution and saving millions of GPU hours in large-scale production environments.

Authors:Sungduk Yu, Man Luo, Avinash Madusu, Vasudev Lal, Phillip Howard
Title: Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
Abstract:
Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. Our dataset is publicly available at: https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark.
中文摘要:本研究引入了一个全面的数据集,用于评估同行评审中AI文本的检测能力,揭示了识别AI生成内容的困难,并强调迫切需要开发新工具来防范生成式AI在科学评估中的不道德使用。
English Summary: The study introduces a comprehensive dataset to benchmark AI text detection in peer reviews, revealing the difficulty of identifying AI-generated content and underscoring the urgent need for improved detection tools against unethical AI use in scientific evaluation.

Authors:Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, Tong Zhang
Title: Self-rewarding correction for mathematical reasoning
Abstract:
We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.
中文: 本研究提出自奖励推理大模型,通过仅使用自生成数据的双阶段算法框架,使模型能够自主生成推理步骤并进行评估,在不依赖外部反馈的情况下实现了与外部奖励系统相当的性能。
English: This research introduces self-rewarding reasoning LLMs that autonomously generate and evaluate reasoning steps through a two-stage framework using self-generated data, achieving performance comparable to external reward systems without requiring external feedback.

Authors:Cornelius Emde, Alasdair Paren, Preetham Arvind, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Bernard Ghanem, Philip H. S. Torr, Adel Bibi
Title: Shh, don't say that! Domain Certification in LLMs
Abstract:
Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.
中文: 大型语言模型存在生成超域输出的对抗性风险,因此引入领域认证和VALID方法,在保持拒绝行为的同时提供紧密的对抗性边界保证。
English: Large language models face adversarial risks of generating out-of-domain outputs, prompting the introduction of domain certification and the VALID method to provide tight adversarial bounds while preserving refusal behavior.

Authors:Qizhou Wang, Jin Peng Zhou, Zhanke Zhou, Saebyeol Shin, Bo Han, Kilian Q. Weinberger
Title: Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond
Abstract:
Large language models (LLMs) should undergo rigorous audits to identify potential risks, such as copyright and privacy infringements. Once these risks emerge, timely updates are crucial to remove undesirable responses, ensuring legal and safe model usage. It has spurred recent research into LLM unlearning, focusing on erasing targeted undesirable knowledge without compromising the integrity of other, non-targeted responses. Existing studies have introduced various unlearning objectives to pursue LLM unlearning without necessitating complete retraining. However, each of these objectives has unique properties, and no unified framework is currently available to comprehend them thoroughly. To fill the gap, we propose a toolkit of the gradient effect (G-effect), quantifying the impacts of unlearning objectives on model performance from a gradient perspective. A notable advantage is its broad ability to detail the unlearning impacts from various aspects across instances, updating steps, and LLM layers. Accordingly, the G-effect offers new insights into identifying drawbacks of existing unlearning objectives, further motivating us to explore a series of new solutions for their mitigation and improvements. Finally, we outline promising directions that merit further studies, aiming at contributing to the community to advance this important field.
中文: 该摘要提出了一种梯度效应工具包,用于系统评估和改进大语言模型的遗忘方法,解决了当前缺乏统一框架来衡量不同遗忘目标对模型性能影响的问题,并为推进更安全的AI技术提出了新方向。
English: The abstract introduces a gradient effect toolkit to systematically evaluate and improve large language model unlearning methods, addressing the lack of a unified framework for assessing how different unlearning objectives impact model performance and proposing new solutions for safer AI development.

Authors:Qiancheng Xu, Yongqi Li, Heming Xia, Fan Liu, Min Yang, Wenjie Li
Title: PEToolLLM: Towards Personalized Tool Learning in Large Language Models
Abstract:
Tool learning has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools. Existing tool learning studies primarily focus on the general-purpose tool-use capability, which addresses explicit user requirements in instructions. However, they overlook the importance of personalized tool-use capability, leading to an inability to handle implicit user preferences. To address the limitation, we first formulate the task of personalized tool learning, which integrates user's interaction history towards personalized tool usage. To fill the gap of missing benchmarks, we construct PEToolBench, featuring diverse user preferences reflected in interaction history under three distinct personalized settings, and encompassing a wide range of tool-use scenarios. Moreover, we propose a framework PEToolLLaMA to adapt LLMs to the personalized tool learning task, which is trained through supervised fine-tuning and direct preference optimization. Extensive experiments on PEToolBench demonstrate the superiority of PEToolLLaMA over existing LLMs.
中文: 现有工具学习主要关注通用能力而忽视个性化需求,为此我们构建了PEToolBench基准并提出了PEToolLLaMA框架,通过定制化训练实现用户偏好的工具调用。
English: Current tool learning for LLMs focuses on general-purpose capabilities but neglects personalized user preferences, prompting the development of PEToolBench and PEToolLLaMA to address this gap through tailored frameworks and benchmarks.

Authors:Zhengmian Hu, Tong Zheng, Vignesh Viswanathan, Ziyi Chen, Ryan A. Rossi, Yihan Wu, Dinesh Manocha, Heng Huang
Title: Towards Optimal Multi-draft Speculative Decoding
Abstract:
Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts, and the target LLM verifies them in parallel, ensuring that the final output conforms to the target model distribution. The two main design choices in MDSD are the draft sampling method and the verification algorithm. For a fixed draft sampling method, the optimal acceptance rate is a solution to an optimal transport problem, but the complexity of this problem makes it difficult to solve for the optimal acceptance rate and measure the gap between existing verification algorithms and the theoretical upper bound. This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate. For the first time, we measure the theoretical upper bound of MDSD efficiency for vocabulary sizes in the thousands and quantify the gap between existing verification algorithms and this bound. We also compare different draft sampling methods based on their optimal acceptance rates. Our results show that the draft sampling method strongly influences the optimal acceptance rate, with sampling without replacement outperforming sampling with replacement. Additionally, existing verification algorithms do not reach the theoretical upper bound for both without replacement and with replacement sampling. Our findings suggest that carefully designed draft sampling methods can potentially improve the optimal acceptance rate and enable the development of verification algorithms that closely match the theoretical upper bound.
中文摘要: 本文采用对偶方法高效计算多草稿推测解码的最优接受率,发现无放回采样方式效果更优且现有验证算法尚未达到理论效率上限。
English Summary: This paper introduces a dual approach to efficiently compute the optimal acceptance rate for Multi-Draft Speculative Decoding, revealing that draft sampling without replacement achieves higher efficiency and current verification algorithms still fall short of the theoretical upper bound.

Authors:Ali Gholami, Tayyebeh Jahani-Nezhad, Kai Wan, Giuseppe Caire
Title: Optimal Communication-Computation Trade-off in Hierarchical Gradient Coding
Abstract:
In this paper, we study gradient coding in a hierarchical setting, where there are intermediate nodes between the server and the workers. This structure reduces the bandwidth requirements at the server, which is a bottleneck in conventional gradient coding systems. In this paper, the intermediate nodes, referred to as $\textit{relays}$, process the data received from workers and send the results to the server for the final gradient computation. Our main contribution is deriving the optimal communication-computation trade-off by designing a linear coding scheme inspired by coded computing techniques, considering straggling and adversarial nodes among both relays and workers. The processing of the data in the relays makes it possible to achieve both the relay-to-server and the worker-to-relay communication loads simultaneously optimal with regard to the computation load.
中文摘要:本文提出了一种带中间中继的分层梯度编码框架,通过线性编码方案实现了最优通信-计算权衡,同时使中继至服务器和工作者至中继的通信负载均达到最优。
English summary: This paper introduces a hierarchical gradient coding framework with intermediate relays that optimize bandwidth usage and derives an optimal communication-computation trade-off using linear coding to handle stragglers and adversarial nodes.

Authors:Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu
Title: Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation
Abstract:
General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how to address the huge appearance and pose variances for different species. We argue that given appearance context, instance-level prior and the structural relation among keypoints can serve as complementary evidence. To this end, we propose a Keypoint Interactive Transformer (KIT) to learn instance-level structure-supporting dependencies for general mammal pose estimation. Specifically, our KITPose consists of two coupled components. The first component is to extract keypoint features and generate body part prompts. The features are supervised by a dedicated generalised heatmap regression loss (GHRL). Instead of introducing external visual/text prompts, we devise keypoints clustering to generate body part biases, aligning them with image context to generate corresponding instance-level prompts. Second, we propose a novel interactive transformer that takes feature slices as input tokens without performing spatial splitting. In addition, to enhance the capability of the KIT model, we design an adaptive weight strategy to address the imbalance issue among different keypoints.
中文: 本文提出KITPose框架,通过关键点交互式变压器学习实例级结构依赖关系并整合身体部位提示,解决了跨物种姿态估计中外观差异的难题,突破了现有方法局限于特定物种的瓶颈。
English: This paper introduces KITPose, a novel framework utilizing a Keypoint Interactive Transformer to address general mammal pose estimation by learning instance-level structural dependencies and incorporating body part prompts, overcoming the limitations of species-specific approaches.

Authors:Zheling Meng, Bo Peng, Xiaochuan Jin, Yueming Lyu, Wei Wang, Jing Dong, Tieniu Tan
Title: Concept Corrector: Erase concepts on the fly for text-to-image diffusion models
Abstract:
Text-to-image diffusion models have demonstrated the underlying risk of generating various unwanted content, such as sexual elements. To address this issue, the task of concept erasure has been introduced, aiming to erase any undesired concepts that the models can generate. Previous methods, whether training-based or training-free, have primarily focused on the input side, i.e., texts. However, they often suffer from incomplete erasure due to limitations in the generalization from limited prompts to diverse image content. In this paper, motivated by the notion that concept erasure on the output side, i.e., generated images, may be more direct and effective, we propose Concept Corrector. It checks target concepts based on visual features provided by final generated images predicted at certain time steps. Further, it incorporates Concept Removal Attention to erase generated concept features. It overcomes the limitations of existing methods, which are either unable to remove the concept features that have been generated in images or rely on the assumption that the related concept words are contained in input prompts. In the whole pipeline, our method changes no model parameters and only requires a given target concept as well as the corresponding replacement content, which is easy to implement. To the best of our knowledge, this is the first erasure method based on intermediate-generated images, achieving the ability to erase concepts on the fly. The experiments on various concepts demonstrate its impressive erasure performance.
中文摘要:本文提出Concept Corrector方法,通过分析生成图像的视觉特征并采用概念移除注意力机制,在不修改模型参数的情况下实时有效消除生成图像中的不良概念。
English Summary: This paper introduces Concept Corrector, a training-free method that erases unwanted concepts from generated images by analyzing visual features and using Concept Removal Attention, achieving effective real-time concept removal without modifying model parameters.

Authors:Xianghong Xu, Tieying Zhang, Xiao He, Haoyang Li, Rong Kang, Shuai Wang, Linhui Xu, Zhimin Liang, Shangyu Luo, Lei Zhang, Jianjun Chen
Title: AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators
Abstract:
Estimating the Number of Distinct Values (NDV) is fundamental for numerous data management tasks, especially within database applications. However, most existing works primarily focus on introducing new statistical or learned estimators, while identifying the most suitable estimator for a given scenario remains largely unexplored. Therefore, we propose AdaNDV, a learned method designed to adaptively select and fuse existing estimators to address this issue. Specifically, (1) we propose to use learned models to distinguish between overestimated and underestimated estimators and then select appropriate estimators from each category. This strategy provides a complementary perspective by integrating overestimations and underestimations for error correction, thereby improving the accuracy of NDV estimation. (2) To further integrate the estimation results, we introduce a novel fusion approach that employs a learned model to predict the weights of the selected estimators and then applies a weighted sum to merge them. By combining these strategies, the proposed AdaNDV fundamentally distinguishes itself from previous works that directly estimate NDV. Moreover, extensive experiments conducted on real-world datasets, with the number of individual columns being several orders of magnitude larger than in previous studies, demonstrate the superior performance of our method.
中文: AdaNDV是一种创新的学习方法,通过智能选择并融合现有估值器,利用高估与低估的互补性进行误差修正,在大规模真实数据集上实现了更优的估值精度。
English: AdaNDV is a novel learned method that adaptively selects and fuses existing distinct value estimators by leveraging overestimation and underestimation for error correction, achieving superior accuracy on large-scale real-world datasets.

Authors:Keane Ong, Rui Mao, Deeksha Varshney, Erik Cambria, Gianmarco Mengaldo
Title: Towards Robust ESG Analysis Against Greenwashing Risks: Aspect-Action Analysis with Cross-Category Generalization
Abstract:
Sustainability reports are key for evaluating companies' environmental, social and governance, ESG performance, but their content is increasingly obscured by greenwashing - sustainability claims that are misleading, exaggerated, and fabricated. Yet, existing NLP approaches for ESG analysis lack robustness against greenwashing risks, often extracting insights that reflect misleading or exaggerated sustainability claims rather than objective ESG performance. To bridge this gap, we introduce A3CG - Aspect-Action Analysis with Cross-Category Generalization, as a novel dataset to improve the robustness of ESG analysis amid the prevalence of greenwashing. By explicitly linking sustainability aspects with their associated actions, A3CG facilitates a more fine-grained and transparent evaluation of sustainability claims, ensuring that insights are grounded in verifiable actions rather than vague or misleading rhetoric. Additionally, A3CG emphasizes cross-category generalization. This ensures robust model performance in aspect-action analysis even when companies change their reports to selectively favor certain sustainability areas. Through experiments on A3CG, we analyze state-of-the-art supervised models and LLMs, uncovering their limitations and outlining key directions for future research.
中文: 可持续发展报告常被"漂绿"行为扭曲内容,而现有ESG分析的NLP方法缺乏稳健性,为此推出A3CG数据集,通过关联可持续性方面与具体行动,提升评估的透明度与跨类别泛化能力。
English: Sustainability reports are often distorted by greenwashing, but current NLP methods for ESG analysis lack robustness, prompting the introduction of A3CG dataset to enhance transparency and cross-category generalization in evaluating corporate sustainability claims.

Authors:Yang Yao, Xuan Tong, Ruofan Wang, Yixu Wang, Lujundong Li, Liang Liu, Yan Teng, Yingchun Wang
Title: A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
Abstract:
Large Reasoning Models (LRMs) have significantly advanced beyond traditional Large Language Models (LLMs) with their exceptional logical reasoning capabilities, yet these improvements introduce heightened safety risks. When subjected to jailbreak attacks, their ability to generate more targeted and organized content can lead to greater harm. Although some studies claim that reasoning enables safer LRMs against existing LLM attacks, they overlook the inherent flaws within the reasoning process itself. To address this gap, we propose the first jailbreak attack targeting LRMs, exploiting their unique vulnerabilities stemming from the advanced reasoning capabilities. Specifically, we introduce a Chaos Machine, a novel component to transform attack prompts with diverse one-to-one mappings. The chaos mappings iteratively generated by the machine are embedded into the reasoning chain, which strengthens the variability and complexity and also promotes a more robust attack. Based on this, we construct the Mousetrap framework, which makes attacks projected into nonlinear-like low sample spaces with mismatched generalization enhanced. Also, due to the more competing objectives, LRMs gradually maintain the inertia of unpredictable iterative reasoning and fall into our trap. Success rates of the Mousetrap attacking o1-mini, Claude-Sonnet and Gemini-Thinking are as high as 96%, 86% and 98% respectively on our toxic dataset Trotter. On benchmarks such as AdvBench, StrongREJECT, and HarmBench, attacking Claude-Sonnet, well-known for its safety, Mousetrap can astonishingly achieve success rates of 87.5%, 86.58% and 93.13% respectively. Attention: This paper contains inappropriate, offensive and harmful content.
Chinese: 大型推理模型(LRMs)相比传统大语言模型具有更强的逻辑推理能力,但也带来更高的安全风险,通过利用其推理过程中的漏洞,新型的混沌机器和捕鼠器框架能够实现高效的越狱攻击,在多个模型和基准测试中取得惊人成功率。
English: Large Reasoning Models (LRMs) pose greater safety risks than traditional LLMs due to their enhanced logical reasoning, which can be exploited through jailbreak attacks using the novel Chaos Machine and Mousetrap framework to achieve high success rates across multiple models and benchmarks.

Authors:Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu
Title: SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin
Abstract:
Recently, enhancing the numerical and logical reasoning capability of Large Language Models (LLMs) has emerged as a research hotspot. Existing methods face several limitations: inference-phase techniques (e.g., Chain of Thoughts) rely on prompt selection and the pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle with step-wise mathematical correctness and depend on stronger models distillation or human annotations; while Reinforcement Learning (RL) approaches incur high GPU memory costs and unstable training. To address these, we propose \textbf{S}elf-training framework integrating \textbf{P}rocess \textbf{P}reference learning using \textbf{D}ynamic value margin (SPPD). SPPD leverages a process-based Markov Decision Process (MDP) and Bellman optimality equation to derive \textbf{dynamic value margin} on step-level preference optimization, which employs tree-based self-sampling on model responses \textbf{without any distillation} from other models. Furthermore, we theoretically prove that SPPD is \textbf{equivalent to on-policy policy gradient methods} under reward constraints. Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks. We open-source our code at \href{https://anonymous.4open.science/r/SSDPO-D-DCDD}{https://anonymous.4open.science/r/SPPD-DCDD}.
中文: 本文提出SPPD自训练框架,通过动态价值边际的过程偏好学习增强大语言模型的数值推理能力,无需外部模型蒸馏即可在数学基准测试中取得优越性能。
English: This paper introduces SPPD, a self-training framework that enhances LLMs' numerical reasoning through process preference learning with dynamic value margins, achieving superior results without external model distillation.

Authors:Liangqi Lei, Keke Gai, Jing Yu, Liehuang Zhu, Qi Wu
Title: Secure and Efficient Watermarking for Latent Diffusion Models in Model Distribution Scenarios
Abstract:
Latent diffusion models have exhibited considerable potential in generative tasks. Watermarking is considered to be an alternative to safeguard the copyright of generative models and prevent their misuse. However, in the context of model distribution scenarios, the accessibility of models to large scale of model users brings new challenges to the security, efficiency and robustness of existing watermark solutions. To address these issues, we propose a secure and efficient watermarking solution. A new security mechanism is designed to prevent watermark leakage and watermark escape, which considers watermark randomness and watermark-model association as two constraints for mandatory watermark injection. To reduce the time cost of training the security module, watermark injection and the security mechanism are decoupled, ensuring that fine-tuning VAE only accomplishes the security mechanism without the burden of learning watermark patterns. A watermark distribution-based verification strategy is proposed to enhance the robustness against diverse attacks in the model distribution scenarios. Experimental results prove that our watermarking consistently outperforms existing six baselines on effectiveness and robustness against ten image processing attacks and adversarial attacks, while enhancing security in the distribution scenarios.
中文总结:本研究针对潜在扩散模型提出了一种安全高效的水印方案,通过强制水印注入约束防止水印泄露和逃逸,并采用基于分布验证的策略增强模型分发场景下抗攻击的鲁棒性。
English Summary: This study introduces a secure and efficient watermarking solution for latent diffusion models that prevents watermark leakage and escape through mandatory injection constraints while enhancing robustness against attacks via distribution-based verification.

Authors:Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Qing Li, Xiao Huang
Title: Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation
Abstract:
Generating SQLs from user queries is a long-standing challenge, where the accuracy of initial schema linking significantly impacts subsequent SQL generation performance. However, current schema linking models still struggle with missing relevant schema elements or an excess of redundant ones. A crucial reason for this is that commonly used metrics, recall and precision, fail to capture relevant element missing and thus cannot reflect actual schema linking performance. Motivated by this, we propose enhanced schema linking metrics by introducing a restricted missing indicator. Accordingly, we introduce Knapsack optimization-based Schema Linking Approach (KaSLA), a plug-in schema linking method designed to prevent the missing of relevant schema elements while minimizing the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy that first identifies the optimal table linking and subsequently links columns within the selected table to reduce linking candidate space. In each linking process, it utilizes a knapsack optimization approach to link potentially relevant elements while accounting for a limited tolerance of potentially redundant ones. With this optimization, KaSLA-1.6B achieves superior schema linking results compared to large-scale LLMs, including deepseek-v3 with the state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider and BIRD benchmarks verify that KaSLA can significantly improve the SQL generation performance of SOTA Text2SQL models by substituting their schema linking processes.
中文: 该摘要提出了一种名为KaSLA的新模式链接方法,采用背包优化技术来避免遗漏相关数据库元素并减少冗余,在Spider和BIRD基准测试中验证了其能显著提升Text2SQL模型的SQL生成性能。
English: This abstract introduces a new schema linking method called KaSLA that uses knapsack optimization to prevent missing relevant database elements while minimizing redundant ones, significantly improving SQL generation performance in Text2SQL models as validated on Spider and BIRD benchmarks.

Authors:Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
Title: Improved Unbiased Watermark for Large Language Models
Abstract:
As artificial intelligence surpasses human capabilities in text generation, the necessity to authenticate the origins of AI-generated content has become paramount. Unbiased watermarks offer a powerful solution by embedding statistical signals into language model-generated text without distorting the quality. In this paper, we introduce MCmark, a family of unbiased, Multi-Channel-based watermarks. MCmark works by partitioning the model's vocabulary into segments and promoting token probabilities within a selected segment based on a watermark key. We demonstrate that MCmark not only preserves the original distribution of the language model but also offers significant improvements in detectability and robustness over existing unbiased watermarks. Our experiments with widely-used language models demonstrate an improvement in detectability of over 10% using MCmark, compared to existing state-of-the-art unbiased watermarks. This advancement underscores MCmark's potential in enhancing the practical application of watermarking in AI-generated texts.
中文: MCmark提出了一种无偏的多通道水印方法,在保持语言模型质量的同时显著提升了AI生成文本的可检测性和鲁棒性,相比现有技术检测率提高超过10%。
English: MCmark introduces an unbiased, multi-channel watermarking method that enhances detectability and robustness in AI-generated text while preserving language model quality, showing over 10% improvement in detection compared to existing techniques.

Authors:In-Chang Baek, Sung-Hyun Kim, Sam Earle, Zehua Jiang, Noh Jin-Ha, Julian Togelius, Kyung-Joong Kim
Title: PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning
Abstract:
Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent years, several studies have explored reward generation for training game agents and controlling robots using large language models (LLMs). In the content generation literature, there has been early work on generating reward functions for reinforcement learning agent generators. This work introduces PCGRLLM, an extended architecture based on earlier work, which employs a feedback mechanism and several reasoning-based prompt engineering techniques. We evaluate the proposed method on a story-to-reward generation task in a two-dimensional environment using two state-of-the-art LLMs, demonstrating the generalizability of our approach. Our experiments provide insightful evaluations that demonstrate the capabilities of LLMs essential for content generation tasks. The results highlight significant performance improvements of 415% and 40% respectively, depending on the zero-shot capabilities of the language model. Our work demonstrates the potential to reduce human dependency in game AI development, while supporting and enhancing creative processes.
中文摘要:本研究提出PCGRLLM架构,利用大型语言模型结合反馈机制和推理提示技术生成游戏AI奖励,性能提升最高达415%,显著降低了开发中对人工的依赖。
English Summary: The study introduces PCGRLLM, an architecture using LLMs with feedback and reasoning techniques to generate rewards for game AIs, achieving up to 415% performance improvement and reducing human effort in development.

Authors:Yandi Liu, Guowei Liu, Le Liang, Hao Ye, Chongtao Guo, Shi Jin
Title: Deep Reinforcement Learning-Based User Scheduling for Collaborative Perception
Abstract:
Stand-alone perception systems in autonomous driving suffer from limited sensing ranges and occlusions at extended distances, potentially resulting in catastrophic outcomes. To address this issue, collaborative perception is envisioned to improve perceptual accuracy by using vehicle-to-everything (V2X) communication to enable collaboration among connected and autonomous vehicles and roadside units. However, due to limited communication resources, it is impractical for all units to transmit sensing data such as point clouds or high-definition video. As a result, it is essential to optimize the scheduling of communication links to ensure efficient spectrum utilization for the exchange of perceptual data. In this work, we propose a deep reinforcement learning-based V2X user scheduling algorithm for collaborative perception. Given the challenges in acquiring perceptual labels, we reformulate the conventional label-dependent objective into a label-free goal, based on characteristics of 3D object detection. Incorporating both channel state information (CSI) and semantic information, we develop a double deep Q-Network (DDQN)-based user scheduling framework for collaborative perception, named SchedCP. Simulation results verify the effectiveness and robustness of SchedCP compared with traditional V2X scheduling methods. Finally, we present a case study to illustrate how our proposed algorithm adaptively modifies the scheduling decisions by taking both instantaneous CSI and perceptual semantics into account.
Chinese: 自动驾驶中的协同感知通过车联网通信解决感知范围受限和遮挡问题,但受限于通信资源,需优化链路调度;为此提出基于深度强化学习的SchedCP算法,融合信道状态与语义信息,实现高效感知数据交互。
English: Collaborative perception in autonomous driving addresses limited sensing and occlusions by using V2X communication, but requires optimized scheduling of communication links due to resource constraints, leading to the development of a deep reinforcement learning-based algorithm called SchedCP that incorporates both channel state and semantic information for efficient data exchange.

Authors:Junfeng Guo, Yiming Li, Ruibo Chen, Yihan Wu, Chenxi Liu, Yanshuo Chen, Heng Huang
Title: Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning
Abstract:
Large language models (LLMs) are increasingly integrated into real-world personalized applications through retrieval-augmented generation (RAG) mechanisms to supplement their responses with domain-specific knowledge. However, the valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries. Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks. However, these methods require altering the LLM's results of verification samples, inevitably making these watermarks susceptible to anomaly detection and even introducing new security risks. To address these challenges, we propose \name{} for `harmless' copyright protection of knowledge bases. Instead of manipulating LLM's final output, \name{} implants distinct yet benign verification behaviors in the space of chain-of-thought (CoT) reasoning, maintaining the correctness of the final answer. Our method has three main stages: (1) Generating CoTs: For each verification question, we generate two `innocent' CoTs, including a target CoT for building watermark behaviors; (2) Optimizing Watermark Phrases and Target CoTs: Inspired by our theoretical analysis, we optimize them to minimize retrieval errors under the \emph{black-box} and \emph{text-only} setting of suspicious LLM, ensuring that only watermarked verification queries can retrieve their correspondingly target CoTs contained in the knowledge base; (3) Ownership Verification: We exploit a pairwise Wilcoxon test to verify whether a suspicious LLM is augmented with the protected knowledge base by comparing its responses to watermarked and benign verification queries. Our experiments on diverse benchmarks demonstrate that \name{} effectively protects knowledge bases and its resistance to adaptive attacks.
中文摘要:本文提出的\name{}方法通过在思维链推理中植入验证行为而非修改最终输出来保护检索增强生成系统中的专有知识库,实现了版权保护与回答准确性的双重保障。
English Summary: The proposed \name{} method protects proprietary knowledge bases in retrieval-augmented generation systems by embedding verification behaviors in chain-of-thought reasoning without altering final outputs, ensuring both copyright protection and response accuracy.

Authors:Mayank Vatsa, Aparna Bharati, Surbhi Mittal, Richa Singh
Title: From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models
Abstract:
Negation, a linguistic construct conveying absence, denial, or contradiction, poses significant challenges for multilingual multimodal foundation models. These models excel in tasks like machine translation, text-guided generation, image captioning, audio interactions, and video processing but often struggle to accurately interpret negation across diverse languages and cultural contexts. In this perspective paper, we propose a comprehensive taxonomy of negation constructs, illustrating how structural, semantic, and cultural factors influence multimodal foundation models. We present open research questions and highlight key challenges, emphasizing the importance of addressing these issues to achieve robust negation handling. Finally, we advocate for specialized benchmarks, language-specific tokenization, fine-grained attention mechanisms, and advanced multimodal architectures. These strategies can foster more adaptable and semantically precise multimodal foundation models, better equipped to navigate and accurately interpret the complexities of negation in multilingual, multimodal environments.
中文: 否定结构对多语言多模态基础模型构成显著挑战,需要通过专门基准和先进架构等策略来提升其在跨语言文化环境中的精准解读能力。
English: Negation presents major challenges for multilingual multimodal foundation models, requiring improved strategies like specialized benchmarks and advanced architectures to enhance their interpretation across languages and cultural contexts.

Authors:Dongqing Wang, Ehsan Pajouheshgar, Yitao Xu, Tong Zhang, Sabine Süsstrunk
Title: Volumetric Temporal Texture Synthesis for Smoke Stylization using Neural Cellular Automata
Abstract:
Artistic stylization of 3D volumetric smoke data is still a challenge in computer graphics due to the difficulty of ensuring spatiotemporal consistency given a reference style image, and that within reasonable time and computational resources. In this work, we introduce Volumetric Neural Cellular Automata (VNCA), a novel model for efficient volumetric style transfer that synthesizes, in real-time, multi-view consistent stylizing features on the target smoke with temporally coherent transitions between stylized simulation frames. VNCA synthesizes a 3D texture volume with color and density stylization and dynamically aligns this volume with the intricate motion patterns of the smoke simulation under the Eulerian framework. Our approach replaces the explicit fluid advection modeling and the inter-frame smoothing terms with the self-emerging motion of the underlying cellular automaton, thus reducing the training time by over an order of magnitude. Beyond smoke simulations, we demonstrate the versatility of our approach by showcasing its applicability to mesh stylization.
中文: VNCA模型通过用神经细胞自动机替代传统流体平流,实现了实时三维烟雾的时空风格化,训练速度提升十倍以上,并可扩展至网格应用。
English: The VNCA model enables real-time, spatiotemporal stylization of 3D smoke by replacing traditional fluid advection with neural cellular automata, achieving over 10x faster training while extending to mesh applications.

Authors:Shuo Wang, Keke Gai, Jing Yu, Liehuang Zhu, Qi Wu
Title: Vertical Federated Continual Learning via Evolving Prototype Knowledge
Abstract:
Vertical Federated Learning (VFL) has garnered significant attention as a privacy-preserving machine learning framework for sample-aligned feature federation. However, traditional VFL approaches do not address the challenges of class and feature continual learning, resulting in catastrophic forgetting of knowledge from previous tasks. To address the above challenge, we propose a novel vertical federated continual learning method, named Vertical Federated Continual Learning via Evolving Prototype Knowledge (V-LETO), which primarily facilitates the transfer of knowledge from previous tasks through the evolution of prototypes. Specifically, we propose an evolving prototype knowledge method, enabling the global model to retain both previous and current task knowledge. Furthermore, we introduce a model optimization technique that mitigates the forgetting of previous task knowledge by restricting updates to specific parameters of the local model, thereby enhancing overall performance. Extensive experiments conducted in both CIL and FIL settings demonstrate that our method, V-LETO, outperforms the other state-of-the-art methods. For example, our method outperforms the state-of-the-art method by 10.39% and 35.15% for CIL and FIL tasks, respectively. Our code is available at https://anonymous.4open.science/r/V-LETO-0108/README.md.
中文摘要:提出的垂直联邦持续学习方法(V-LETO)通过演化原型和优化模型参数来解决联邦学习中的灾难性遗忘问题,实验证明其性能显著优于现有先进方法。
English Summary: The proposed Vertical Federated Continual Learning method (V-LETO) addresses catastrophic forgetting in federated learning by evolving prototypes and optimizing model parameters, demonstrating superior performance over existing methods in experimental evaluations.

Authors:Changhua Pei, Zexin Wang, Fengrui Liu, Zeyan Li, Yang Liu, Xiao He, Rong Kang, Tieying Zhang, Jianjun Chen, Jianhui Li, Gaogang Xie, Dan Pei
Title: Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis
Abstract:
In the realm of microservices architecture, the occurrence of frequent incidents necessitates the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a serious incident can take several domain experts hours to identify the root cause. Consequently, a contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for RCA. Though the recent ReAct framework aligns well with the Site Reliability Engineers (SREs) for its thought-action-observation paradigm, its hallucinations often lead to irrelevant actions and directly affect subsequent results. Additionally, the complex and variable clues of the incident can overwhelm the model one step further. To confront these challenges, we propose Flow-of-Action, a pioneering Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system. By explicitly summarizing the diagnosis steps of SREs, SOP imposes constraints on LLMs at crucial junctures, guiding the RCA process towards the correct trajectory. To facilitate the rational and effective utilization of SOPs, we design an SOP-centric framework called SOP flow. SOP flow contains a series of tools, including one for finding relevant SOPs for incidents, another for automatically generating SOPs for incidents without relevant ones, and a tool for converting SOPs into code. This significantly alleviates the hallucination issues of ReAct in RCA tasks. We also design multiple auxiliary agents to assist the main agent by removing useless noise, narrowing the search space, and informing the main agent whether the RCA procedure can stop. Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.
中文摘要:提出的Flow-of-Action框架通过标准操作程序引导基于大语言模型的多智能体系统,显著减少幻觉现象,将微服务架构中根因分析的准确率从35.50%提升至64.01%。
English Summary: The proposed Flow-of-Action framework enhances Root Cause Analysis in microservices by using Standard Operating Procedures to guide multi-agent LLM systems, significantly reducing hallucinations and improving accuracy from 35.50% to 64.01%.

Authors:Yingce Xia, Peiran Jin, Shufang Xie, Liang He, Chuan Cao, Renqian Luo, Guoqing Liu, Yue Wang, Zequn Liu, Yuan-Jyue Chen, Zekun Guo, Yeqi Bai, Pan Deng, Yaosen Min, Ziheng Lu, Hongxia Hao, Han Yang, Jielan Li, Chang Liu, Jia Zhang, Jianwei Zhu, Ran Bi, Kehan Wu, Wei Zhang, Kaiyuan Gao, Qizhi Pei, Qian Wang, Xixian Liu, Yanting Li, Houtian Zhu, Yeqing Lu, Mingqian Ma, Zun Wang, Tian Xie, Krzysztof Maziarz, Marwin Segler, Zhao Yang, Zilong Chen, Yu Shi, Shuxin Zheng, Lijun Wu, Chen Hu, Peggy Dai, Tie-Yan Liu, Haiguang Liu, Tao Qin
Title: Nature Language Model: Deciphering the Language of Nature for Scientific Discovery
Abstract:
Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, RNA and even cells. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) top performance across different domains, matching or surpassing state-of-the-art specialist models. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.
中文: 基础模型革新了人工智能,使机器能更好地理解和生成人类语言,并催生了针对科学领域的专门模型;NatureLM作为一个统一的多领域模型,在生成和优化分子、蛋白质等科学实体方面表现卓越,在各类任务中达到顶尖水平。
English: Foundation models have transformed AI by enabling machines to better understand and generate human language, leading to the development of specialized models for scientific domains; NatureLM is a unified, multi-domain model that excels in generating and optimizing scientific entities like molecules and proteins, achieving top performance across various tasks.

Authors:Heyang Zhao, Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang
Title: Logarithmic Regret for Online KL-Regularized Reinforcement Learning
Abstract:
Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making \citep{xiong2024iterative, xie2024exploratory,zhao2024sharp}, these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm achieves an $\mathcal{O}\big(η\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ logarithmic regret bound, where $η, N_{\mathcal R},T,d_{\mathcal R}$ denote the KL-regularization parameter, the cardinality of the reward function class, number of rounds, and the complexity of the reward function class. Furthermore, we extend our algorithm and analysis to reinforcement learning by developing a novel decomposition over transition steps and also obtain a similar logarithmic regret bound.
中文: 近期研究表明KL正则化在大型语言模型的强化学习微调中起关键作用,新算法通过乐观奖励估计和利用KL正则化的优化优势,实现了对数遗憾界。
English: Recent research highlights the critical role of KL-regularization in enhancing RL fine-tuning for large language models, with a new algorithm achieving logarithmic regret bounds through optimistic reward estimation and leveraging the optimization benefits of KL-regularization.

Authors:Jincheng Mei, Bo Dai, Alekh Agarwal, Sharan Vaswani, Anant Raj, Csaba Szepesvari, Dale Schuurmans
Title: Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates
Abstract:
We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.
中文: 随机梯度赌博算法在任何恒定学习率下几乎必然收敛至全局最优策略,即使在标准平滑性和噪声控制假设失效时仍能恰当平衡探索与利用。
English: The stochastic gradient bandit algorithm achieves global optimality almost surely with any constant learning rate, effectively balancing exploration and exploitation even without standard smoothness and noise assumptions.

Authors:Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji
Title: SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering
Abstract:
Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants -- whether humans or AI agents -- to stay on the same page as their environment evolves. When a collaborator's understanding diverges from the current state -- what we term the out-of-sync challenge -- the collaborator's actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents' capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agent <= 3.33% to Claude-3.5-Sonnet >= 28.18%), their consistently low collaboration willingness (<= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents' resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems. Code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.
中文: 本文提出SyncMind框架来解决协作软件工程中的"不同步"挑战,即LLM智能体对代码库的理解与实际状态发生偏离,并通过SyncBench基准测试揭示现有智能体在协作意愿和资源意识方面存在根本性局限,尽管协作发生时其性能与恢复成功率呈正相关。
English: This paper introduces SyncMind, a framework addressing the out-of-sync challenge in collaborative software engineering where LLM agents' understanding diverges from actual codebases, and presents SyncBench, a benchmark revealing critical limitations in existing agents' collaboration willingness and resource awareness despite showing performance correlations when collaboration occurs.

Authors:Ryan Synk, Monte Hoover, John Kirchenbauer, Neel Jain, Alex Stein, Manli Shu, Josue Melendez Sanchez, Ramani Duraiswami, Tom Goldstein
Title: Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs
Abstract:
There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of transformers at long contexts on commodity (i.e not data center scale) hardware. To address the inference time costs associated with running self-attention based transformer language models on long contexts and enable their adoption on widely available hardware, we propose a tunable mechanism that reduces the cost of the forward pass by attending to only the most relevant tokens at every generation step using a top-k selection mechanism. We showcase the efficiency gains afforded by our method by performing inference on context windows up to 1M tokens using approximately 16GB of GPU RAM. Our experiments reveal that models are capable of handling the sparsity induced by the reduced number of keys and values. By attending to less than 2% of input tokens, we achieve over 95% of model performance on common benchmarks (RULER, AlpacaEval, and Open LLM Leaderboard).
Chinese: 为解决长文本推理中Transformer模型计算资源过高的问题,我们提出可调节的top-k选择机制,通过仅关注关键标记将注意力降至2%以下,在标准测试中保持95%以上性能表现。
English: To address the high computational demands of transformer inference on long contexts, we introduce a tunable top-k selection mechanism that reduces costs by focusing on the most relevant tokens, achieving over 95% performance with less than 2% token attention on standard benchmarks.

Authors:Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, Mengdi Wang
Title: MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
Abstract:
Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations -- modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycksmath et. al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models.
中文摘要:本研究通过构建MATH-P-Simple和MATH-P-Hard基准测试,发现语言模型在改变问题本质的强扰动下性能显著下降,暴露出模型会盲目套用解题模式而忽视情境适用性的记忆化缺陷,这对开发可靠推理模型提出重要挑战。
English Summary: This study introduces MATH-P-Simple and MATH-P-Hard benchmarks to test whether large language models truly reason or merely memorize, revealing significant performance drops with hard perturbations that alter problem nature and exposing a concerning tendency to misapply learned skills without contextual assessment.

Authors:Chenyu Ni, Sijie Chen, Che-Kai Liu, Liu Liu, Mohsen Imani, Thomas Kampfe, Kai Ni, Michael Niemier, Xiaobo Sharon Hu, Cheng Zhuo, Xunzhao Yin
Title: TAP-CAM: A Tunable Approximate Matching Engine based on Ferroelectric Content Addressable Memory
Abstract:
Pattern search is crucial in numerous analytic applications for retrieving data entries akin to the query. Content Addressable Memories (CAMs), an in-memory computing fabric, directly compare input queries with stored entries through embedded comparison logic, facilitating fast parallel pattern search in memory. While conventional CAM designs offer exact match functionality, they are inadequate for meeting the approximate search needs of emerging data-intensive applications. Some recent CAM designs propose approximate matching functions, but they face limitations such as excessively large cell area or the inability to precisely control the degree of approximation. In this paper, we propose TAP-CAM, a novel ferroelectric field effect transistor (FeFET) based ternary CAM (TCAM) capable of both exact and tunable approximate matching. TAP-CAM employs a compact 2FeFET-2R cell structure as the entry storage unit, and similarities in Hamming distances between input queries and stored entries are measured using an evaluation transistor associated with the matchline of CAM array. The operation, robustness and performance of the proposed design at array level have been discussed and evaluated, respectively. We conduct a case study of K-nearest neighbor (KNN) search to benchmark the proposed TAP-CAM at application level. Results demonstrate that compared to 16T CMOS CAM with exact match functionality, TAP-CAM achieves a 16.95x energy improvement, along with a 3.06% accuracy enhancement. Compared to 2FeFET TCAM with approximate match functionality, TAP-CAM achieves a 6.78x energy improvement.
中文: 本文提出了一种基于铁电场效应晶体管的新型三元内容可寻址存储器TAP-CAM,它能够实现精确且可调节的近似模式匹配,与现有设计相比在节能和精度提升方面表现显著。
English: This paper introduces TAP-CAM, a novel ferroelectric field effect transistor-based ternary content addressable memory that enables both exact and tunable approximate pattern matching, achieving significant energy savings and enhanced accuracy compared to existing designs.

Authors:M Charity, Mayu Wilson, Steven Lee, Dipika Rajesh, Sam Earle, Julian Togelius
Title: Amorphous Fortress Online: Collaboratively Designing Open-Ended Multi-Agent AI and Game Environments
Abstract:
This work introduces Amorphous Fortress Online -- a web-based platform where users can design petri-dish-like environments and games consisting of multi-agent AI characters. Users can play, create, and share artificial life and game environments made up of microscopic but transparent finite-state machine agents that interact with each other. The website features multiple interactive editors and accessible settings to view the multi-agent interactions directly from the browser. This system serves to provide a database of thematically diverse AI and game environments that use the emergent behaviors of simple AI agents.
中文: Amorphous Fortress Online 是一个网络平台,用户可通过基于浏览器的交互编辑器设计、游玩和共享由透明有限状态机AI智能体构成的微观人工生命环境,观察其涌现行为。
English: Amorphous Fortress Online is a web platform enabling users to design, play, and share petri-dish-style environments with transparent finite-state machine AI agents that interact through emergent behaviors.

Authors:Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, De-An Huang
Title: QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
Abstract:
We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
中文: QLIP是一种视觉标记化方法,通过动态损失平衡和两阶段训练,将高质量图像重建与卓越的零样本理解能力相结合,并在多模态理解和图像生成任务中验证了其有效性。
English: QLIP is a visual tokenization method that integrates high-quality image reconstruction with superior zero-shot understanding, balancing reconstruction and language-image alignment through dynamic loss management and a two-stage training pipeline.

Authors:Yusheng Dai, Chenxi Wang, Chang Li, Chen Wang, Jun Du, Kewei Li, Ruoyu Wang, Jiefeng Ma, Lei Sun, Jianqing Gao
Title: Latent Swap Joint Diffusion for 2D Long-Form Latent Generation
Abstract:
This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion. Furthermore, to improve global cross-view consistency in non-overlapping regions, we introduce Reference-Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we can achieve a cross-view similarity-diversity balance in a forward-only manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based methods in audio generation using both U-Net and DiT models, along with effective longer length adaptation. It also adapts well to panorama generation, achieving comparable performance with 2 $\sim$ 20 $\times$ faster speed and greater model generalizability. More generation demos are available at https://swapforward.github.io/
中文: 本文提出Swap Forward (SaFa)方法,通过潜在交换联合扩散技术,采用自循环和参考引导的潜在交换操作,有效解决频谱混叠问题并增强跨视图一致性,在音频和全景生成中实现高质量输出与高效性能。
English: This paper presents Swap Forward (SaFa), a modality-agnostic method that uses latent swap joint diffusion to generate seamless long spectrograms and panoramas by addressing spectrum aliasing through adaptive high-frequency enhancement and cross-view synchronization.

Authors:Keke Gai, Mohan Wang, Jing Yu, Dongjue Wang, Qi Wu
Title: Adaptive Prototype Knowledge Transfer for Federated Learning with Mixed Modalities and Heterogeneous Tasks
Abstract:
Multimodal Federated Learning (MFL) with mixed modalities enables unimodal and multimodal clients to collaboratively train models while ensuring clients' privacy. As a representative sample of local data, prototypes offer an approach with low resource consumption and no reliance on prior knowledge for MFL with mixed modalities. However, existing prototype-based MFL methods assume unified labels across clients and identical tasks per client, which is impractical in MFL with mixed modalities. In this work, we propose an Adaptive prototype-based Multimodal Federated Learning (AproMFL) framework for mixed modalities to address the aforementioned issues. Our AproMFL transfers knowledge through adaptively-constructed prototypes without unified labels. Clients adaptively select prototype construction methods in line with labels; server converts client prototypes into unified multimodal prototypes and cluster them to form global prototypes. To address model aggregation issues in task heterogeneity, we develop a client relationship graph-based scheme to dynamically adjust aggregation weights. Furthermore, we propose a global prototype knowledge transfer loss and a global model knowledge transfer loss to enable the transfer of global knowledge to local knowledge. Experimental results show that AproMFL outperforms four baselines on three highly heterogeneous datasets ($α=0.1$) and two heterogeneous tasks, with the optimal results in accuracy and recall being 0.42%~6.09% and 1.6%~3.89% higher than those of FedIoT (FedAvg-based MFL), respectively.
Chinese: 提出的自适应原型多模态联邦学习(AproMFL)框架通过自适应构建原型和动态调整聚合权重,实现在混合模态的异构客户端间进行协同训练,在异构数据集和任务上相比现有方法取得了更优的性能表现。
English: The proposed Adaptive Prototype-based Multimodal Federated Learning (AproMFL) framework enables collaborative training across heterogeneous clients with mixed modalities by adaptively constructing prototypes and dynamically adjusting aggregation weights, achieving superior performance over existing methods on heterogeneous datasets and tasks.

Authors:Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, Geng Yuan
Title: Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning
Abstract:
Large language models (LLMs) excel across various tasks, but standard first-order (FO) fine-tuning demands considerable memory, significantly limiting real-world deployment. Recently, zeroth-order (ZO) optimization stood out as a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind FO method in both convergence speed and accuracy. To bridge the gap, we introduce a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. Aiming to resemble the learning capacity of FO method from the findings, we propose Divergence-driven Zeroth-Order (DiZO) optimization. DiZO conducts divergence-driven layer adaptation by incorporating projections to ZO updates, generating diverse-magnitude updates precisely scaled to layer-wise individual optimization needs. Our results demonstrate that DiZO significantly reduces the needed iterations for convergence without sacrificing throughput, cutting training GPU hours by up to 48% on various datasets. Moreover, DiZO consistently outperforms the representative ZO baselines in fine-tuning RoBERTa-large, OPT-series, and Llama-series on downstream tasks and, in some cases, even surpasses memory-intensive FO fine-tuning. Our code is released at https://anonymous.4open.science/r/DiZO-E86D.
中文: DiZO优化通过基于层间差异的自适应调整技术,显著加速大语言模型的零阶微调过程,在减少高达48%训练时间的同时,性能达到甚至超越传统一阶优化方法。
English: DiZO optimization introduces a divergence-driven layer adaptation technique that significantly accelerates zeroth-order fine-tuning of large language models, reducing training time by up to 48% while matching or surpassing first-order performance.

Authors:Jixun Yao, Yuguang Yang, Yu Pan, Yuan Feng, Ziqian Ning, Jianhao Ye, Hongbin Zhou, Lei Xie
Title: Fine-grained Preference Optimization Improves Zero-shot Text-to-Speech
Abstract:
Integrating human feedback to align text-to-speech (TTS) system outputs with human preferences has proven to be an effective approach for enhancing the robustness of language model-based TTS systems. Current approaches primarily focus on using preference data annotated at the utterance level. However, frequent issues that affect the listening experience often only arise in specific segments of audio samples, while other segments are well-generated. In this study, we propose a fine-grained preference optimization approach (FPO) to enhance the robustness of TTS systems. FPO focuses on addressing localized issues in generated samples rather than uniformly optimizing the entire utterance. Specifically, we first analyze the types of issues in generated samples, categorize them into two groups, and propose a selective training loss strategy to optimize preferences based on fine-grained labels for each issue type. Experimental results show that FPO enhances the robustness of zero-shot TTS systems by effectively addressing local issues, significantly reducing the bad case ratio, and improving intelligibility. Furthermore, FPO exhibits superior data efficiency compared with baseline systems, achieving similar performance with fewer training samples.
中文: 本研究提出细粒度偏好优化方法,通过针对音频样本中的局部问题来增强语音合成系统的鲁棒性,显著降低错误率并提高可懂度,同时展现出更优的数据效率。
English: This study introduces a fine-grained preference optimization (FPO) method that enhances text-to-speech system robustness by targeting localized audio issues, significantly reducing errors and improving intelligibility with greater data efficiency.

Authors:Leonardo Defilippis, Yatin Dandi, Pierre Mergny, Florent Krzakala, Bruno Loureiro
Title: Optimal Spectral Transitions in High-Dimensional Multi-Index Models
Abstract:
We consider the problem of how many samples from a Gaussian multi-index model are required to weakly reconstruct the relevant index subspace. Despite its increasing popularity as a testbed for investigating the computational complexity of neural networks, results beyond the single-index setting remain elusive. In this work, we introduce spectral algorithms based on the linearization of a message passing scheme tailored to this problem. Our main contribution is to show that the proposed methods achieve the optimal reconstruction threshold. Leveraging a high-dimensional characterization of the algorithms, we show that above the critical threshold the leading eigenvector correlates with the relevant index subspace, a phenomenon reminiscent of the Baik-Ben Arous-Peche (BBP) transition in spiked models arising in random matrix theory. Supported by numerical experiments and a rigorous theoretical framework, our work bridges critical gaps in the computational limits of weak learnability in multi-index model.
Chinese: 本研究基于线性化消息传递提出了谱算法,用于在高斯多索引模型中实现相关子空间的最优重构,通过数值和理论验证达到了临界阈值,在此之上特征向量与子空间产生相关性。
English: This study introduces spectral algorithms based on linearized message passing to optimally reconstruct the relevant subspace in Gaussian multi-index models, achieving the critical threshold where eigenvector correlation emerges as demonstrated by numerical and theoretical evidence.

Authors:Chris Kolb, Tobias Weber, Bernd Bischl, David Rügamer
Title: Deep Weight Factorization: Sparse Learning Through the Lens of Artificial Symmetries
Abstract:
Sparse regularization techniques are well-established in machine learning, yet their application in neural networks remains challenging due to the non-differentiability of penalties like the $L_1$ norm, which is incompatible with stochastic gradient descent. A promising alternative is shallow weight factorization, where weights are decomposed into two factors, allowing for smooth optimization of $L_1$-penalized neural networks by adding differentiable $L_2$ regularization to the factors. In this work, we introduce deep weight factorization, extending previous shallow approaches to more than two factors. We theoretically establish equivalence of our deep factorization with non-convex sparse regularization and analyze its impact on training dynamics and optimization. Due to the limitations posed by standard training practices, we propose a tailored initialization scheme and identify important learning rate requirements necessary for training factorized networks. We demonstrate the effectiveness of our deep weight factorization through experiments on various architectures and datasets, consistently outperforming its shallow counterpart and widely used pruning methods.
Chinese: 深度权重分解通过将权重分解为多个因子,扩展了浅层方法,在神经网络中实现了有效的稀疏正则化,并通过定制初始化与学习率策略超越了现有方法。
English: Deep weight factorization extends shallow approaches by decomposing weights into multiple factors, enabling effective sparse regularization in neural networks while outperforming existing methods through tailored initialization and learning rate strategies.

Authors:Senmao Li, Kai Wang, Joost van de Weijer, Fahad Shahbaz Khan, Chun-Le Guo, Shiqi Yang, Yaxing Wang, Jian Yang, Ming-Ming Cheng
Title: InterLCM: Low-Quality Images as Intermediate States of Latent Consistency Models for Effective Blind Face Restoration
Abstract:
Diffusion priors have been used for blind face restoration (BFR) by fine-tuning diffusion models (DMs) on restoration datasets to recover low-quality images. However, the naive application of DMs presents several key limitations. (i) The diffusion prior has inferior semantic consistency (e.g., ID, structure and color.), increasing the difficulty of optimizing the BFR model; (ii) reliance on hundreds of denoising iterations, preventing the effective cooperation with perceptual losses, which is crucial for faithful restoration. Observing that the latent consistency model (LCM) learns consistency noise-to-data mappings on the ODE-trajectory and therefore shows more semantic consistency in the subject identity, structural information and color preservation, we propose InterLCM to leverage the LCM for its superior semantic consistency and efficiency to counter the above issues. Treating low-quality images as the intermediate state of LCM, InterLCM achieves a balance between fidelity and quality by starting from earlier LCM steps. LCM also allows the integration of perceptual loss during training, leading to improved restoration quality, particularly in real-world scenarios. To mitigate structural and semantic uncertainties, InterLCM incorporates a Visual Module to extract visual features and a Spatial Encoder to capture spatial details, enhancing the fidelity of restored images. Extensive experiments demonstrate that InterLCM outperforms existing approaches in both synthetic and real-world datasets while also achieving faster inference speed.
中文:提出的InterLCM方法利用潜在一致性模型提升盲脸恢复中的语义一致性和效率,通过整合感知损失和视觉模块来增强保真度并实现更快的推理速度。
English: The proposed InterLCM method utilizes the latent consistency model to enhance semantic consistency and efficiency in blind face restoration, integrating perceptual loss and visual modules to improve fidelity and achieve faster inference.

Authors:Zhiliang Wu, Kerui Chen, Kun Li, Hehe Fan, Yi Yang
Title: BVINet: Unlocking Blind Video Inpainting with Zero Annotations
Abstract:
Video inpainting aims to fill in corrupted regions of the video with plausible contents. Existing methods generally assume that the locations of corrupted regions are known, focusing primarily on the "how to inpaint". This reliance necessitates manual annotation of the corrupted regions using binary masks to indicate "whereto inpaint". However, the annotation of these masks is labor-intensive and expensive, limiting the practicality of current methods. In this paper, we expect to relax this assumption by defining a new blind video inpainting setting, enabling the networks to learn the mapping from corrupted video to inpainted result directly, eliminating the need of corrupted region annotations. Specifically, we propose an end-to-end blind video inpainting network (BVINet) to address both "where to inpaint" and "how to inpaint" simultaneously. On the one hand, BVINet can predict the masks of corrupted regions by detecting semantic-discontinuous regions of the frame and utilizing temporal consistency prior of the video. On the other hand, the predicted masks are incorporated into the BVINet, allowing it to capture valid context information from uncorrupted regions to fill in corrupted ones. Besides, we introduce a consistency loss to regularize the training parameters of BVINet. In this way, mask prediction and video completion mutually constrain each other, thereby maximizing the overall performance of the trained model. Furthermore, we customize a dataset consisting of synthetic corrupted videos, real-world corrupted videos, and their corresponding completed videos. This dataset serves as a valuable resource for advancing blind video inpainting research. Extensive experimental results demonstrate the effectiveness and superiority of our method.
中文: 本文提出了一种盲视频修复网络(BVINet),无需人工标注即可自动检测并填充视频中的损坏区域,通过端到端方式同时解决“何处修复”与“如何修复”的问题。
English: This paper introduces a blind video inpainting network (BVINet) that automatically detects and fills corrupted regions without requiring manual mask annotations, addressing both "where" and "how" to inpaint through an end-to-end approach.

Authors:Chiyuan He, Zihuan Qiu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Title: DesCLIP: Robust Continual Learning via General Attribute Descriptions for VLM-Based Visual Recognition
Abstract:
Continual learning of vision-language models (VLMs) focuses on leveraging cross-modal pretrained knowledge to incrementally adapt to expanding downstream tasks and datasets, while tackling the challenge of knowledge forgetting. Existing research often focuses on connecting visual features with specific class text in downstream tasks, overlooking the latent relationships between general and specialized knowledge. Our findings reveal that forcing models to optimize inappropriate visual-text matches exacerbates forgetting of VLM's recognition ability. To tackle this issue, we propose DesCLIP, which leverages general attribute (GA) descriptions to guide the understanding of specific class objects, enabling VLMs to establish robust vision-GA-class trilateral associations rather than relying solely on vision-class connections. Specifically, we introduce a language assistant to generate concrete GA description candidates via proper request prompts. Then, an anchor-based embedding filter is designed to obtain highly relevant GA description embeddings, which are leveraged as the paired text embeddings for visual-textual instance matching, thereby tuning the visual encoder. Correspondingly, the class text embeddings are gradually calibrated to align with these shared GA description embeddings. Extensive experiments demonstrate the advancements and efficacy of our proposed method, with comprehensive empirical evaluations highlighting its superior performance in VLM-based recognition compared to existing continual learning methods.
中文: DesCLIP通过建立视觉、通用属性和类别对象之间的稳固三方关联,利用生成的属性描述来校准视觉-文本匹配和类别嵌入,有效解决了视觉语言模型持续学习中的知识遗忘问题。
English: DesCLIP addresses knowledge forgetting in continual learning of vision-language models by establishing robust trilateral associations between vision, general attributes, and class objects, using generated attribute descriptions to calibrate visual-textual matching and class embeddings.

Authors:Zhiyu Tan, Junyan Wang, Hao Yang, Luozheng Qin, Hesen Chen, Qiang Zhou, Hao Li
Title: Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos
Abstract:
Text-to-video generation has demonstrated promising progress with the advent of diffusion models, yet existing approaches are limited by dataset quality and computational resources. To address these limitations, this paper presents a comprehensive approach that advances both data curation and model design. We introduce CFC-VIDS-1M, a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. The pipeline first evaluates video quality across multiple dimensions, followed by a fine-grained stage that leverages vision-language models to enhance text-video alignment and semantic richness. Building upon the curated dataset's emphasis on visual quality and temporal coherence, we develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms. The model is trained through a progressive four-stage strategy designed to efficiently handle the complexities of video generation. Extensive experiments demonstrate that our integrated approach of high-quality data curation and efficient training strategy generates visually appealing and temporally coherent videos while maintaining computational efficiency. We will release our dataset, code, and models.
中文摘要:本文提出了一种综合方法,通过粗细结合的筛选流程构建高质量CFC-VIDS-1M数据集,并结合具有解耦注意力机制的RACCOON变换器模型,在保持计算效率的同时生成了视觉吸引力强且时序连贯的视频。
English Summary: This paper introduces a comprehensive approach combining the high-quality CFC-VIDS-1M dataset curated through a coarse-to-fine pipeline and the RACCOON transformer model with decoupled attention, achieving computationally efficient generation of visually appealing and temporally coherent videos.

Authors:Edo Kadosh, Nir Goren, Or Patashnik, Daniel Garibi, Daniel Cohen-Or
Title: Tight Inversion: Image-Conditioned Inversion for Real Image Editing
Abstract:
Text-to-image diffusion models offer powerful image editing capabilities. To edit real images, many methods rely on the inversion of the image into Gaussian noise. A common approach to invert an image is to gradually add noise to the image, where the noise is determined by reversing the sampling equation. This process has an inherent tradeoff between reconstruction and editability, limiting the editing of challenging images such as highly-detailed ones. Recognizing the reliance of text-to-image models inversion on a text condition, this work explores the importance of the condition choice. We show that a condition that precisely aligns with the input image significantly improves the inversion quality. Based on our findings, we introduce Tight Inversion, an inversion method that utilizes the most possible precise condition -- the input image itself. This tight condition narrows the distribution of the model's output and enhances both reconstruction and editability. We demonstrate the effectiveness of our approach when combined with existing inversion methods through extensive experiments, evaluating the reconstruction accuracy as well as the integration with various editing methods.
中文: 文本到图像扩散模型通过将图像反转为噪声实现编辑,但存在重建与可编辑性之间的权衡;本研究提出紧密反转方法,以输入图像本身为精确条件,有效提升这两方面的性能。
English: Text-to-image diffusion models enable image editing by inverting images into noise, but face a tradeoff between reconstruction and editability, which this work addresses by introducing Tight Inversion that uses the input image itself as a precise condition to enhance both aspects.

Authors:Jimmy Chiun, Shizhe Zhang, Yizhuo Wang, Yuhong Cao, Guillaume Sartoretti
Title: MARVEL: Multi-Agent Reinforcement Learning for constrained field-of-View multi-robot Exploration in Large-scale environments
Abstract:
In multi-robot exploration, a team of mobile robot is tasked with efficiently mapping an unknown environments. While most exploration planners assume omnidirectional sensors like LiDAR, this is impractical for small robots such as drones, where lightweight, directional sensors like cameras may be the only option due to payload constraints. These sensors have a constrained field-of-view (FoV), which adds complexity to the exploration problem, requiring not only optimal robot positioning but also sensor orientation during movement. In this work, we propose MARVEL, a neural framework that leverages graph attention networks, together with novel frontiers and orientation features fusion technique, to develop a collaborative, decentralized policy using multi-agent reinforcement learning (MARL) for robots with constrained FoV. To handle the large action space of viewpoints planning, we further introduce a novel information-driven action pruning strategy. MARVEL improves multi-robot coordination and decision-making in challenging large-scale indoor environments, while adapting to various team sizes and sensor configurations (i.e., FoV and sensor range) without additional training. Our extensive evaluation shows that MARVEL's learned policies exhibit effective coordinated behaviors, outperforming state-of-the-art exploration planners across multiple metrics. We experimentally demonstrate MARVEL's generalizability in large-scale environments, of up to 90m by 90m, and validate its practical applicability through successful deployment on a team of real drone hardware.
中文: MARVEL是一种基于图注意力网络与多智能体强化学习的神经框架,能够为视野受限的机器人实现去中心化的协同探索,无需重新训练即可在大规模环境中超越现有方法性能。
English: MARVEL is a neural framework using graph attention networks and multi-agent reinforcement learning to enable decentralized, coordinated exploration for robots with limited field-of-view sensors, outperforming existing methods in large-scale environments without requiring retraining.

Authors:Dongwei Xu, Yutao Zhu, Yao Lu, Youpeng Feng, Yun Lin, Qi Xuan
Title: MCLRL: A Multi-Domain Contrastive Learning with Reinforcement Learning Framework for Few-Shot Modulation Recognition
Abstract:
With the rapid advancements in wireless communication technology, automatic modulation recognition (AMR) plays a critical role in ensuring communication security and reliability. However, numerous challenges, including higher performance demands, difficulty in data acquisition under specific scenarios, limited sample size, and low-quality labeled data, hinder its development. Few-shot learning (FSL) offers an effective solution by enabling models to achieve satisfactory performance with only a limited number of labeled samples. While most FSL techniques are applied in the field of computer vision, they are not directly applicable to wireless signal processing. This study does not propose a new FSL-specific signal model but introduces a framework called MCLRL. This framework combines multi-domain contrastive learning with reinforcement learning. Multi-domain representations of signals enhance feature richness, while integrating contrastive learning and reinforcement learning architectures enables the extraction of deep features for classification. In downstream tasks, the model achieves excellent performance using only a few samples and minimal training cycles. Experimental results show that the MCLRL framework effectively extracts key features from signals, performs well in FSL tasks, and maintains flexibility in signal model selection.
中文摘要:MCLRL框架融合多领域对比学习与强化学习,通过提取丰富信号特征实现高效的小样本自动调制识别,仅需少量样本即可获得优越性能。
English Summary: The MCLRL framework integrates multi-domain contrastive learning with reinforcement learning to enable effective few-shot automatic modulation recognition by extracting rich signal features, achieving strong performance with minimal training samples.

Authors:Ruokai Yin, Yuhang Li, Priyadarshini Panda
Title: PacQ: A SIMT Microarchitecture for Efficient Dataflow in Hyper-asymmetric GEMMs
Abstract:
Weight-only quantization has been widely explored in large language models (LLMs) to reduce memory storage and data loading overhead. During deployment on single-instruction-multiple-threads (SIMT) architectures, weights are stored in low-precision integer (INT) format, while activations remain in full-precision floating-point (FP) format to preserve inference accuracy. Although memory footprint and data loading requirements for weight matrices are reduced, computation performance gains remain limited due to the need to convert weights back to FP format through unpacking and dequantization before GEMM operations. In this work, we investigate methods to accelerate GEMM operations involving packed low-precision INT weights and high-precision FP activations, defining this as the hyper-asymmetric GEMM problem. Our approach co-optimizes tile-level packing and dataflow strategies for INT weight matrices. We further design a specialized FP-INT multiplier unit tailored to our packing and dataflow strategies, enabling parallel processing of multiple INT weights. Finally, we integrate the packing, dataflow, and multiplier unit into PacQ, a SIMT microarchitecture designed to efficiently accelerate hyper-asymmetric GEMMs. We show that PacQ can achieve up to 1.99x speedup and 81.4% reduction in EDP compared to weight-only quantized LLM workloads running on conventional SIMT baselines.
Chinese: 仅权重量化虽能降低大语言模型的内存占用,但受限于浮点转换需求而计算增益有限;为此开发的PacQ微架构通过优化数据流和专用乘法单元,将超非对称矩阵乘法的速度提升至1.99倍并降低81.4%的能耗延迟积。
English: Weight-only quantization reduces memory usage in LLMs but limits computational gains due to FP conversion needs, leading to the development of PacQ, a specialized microarchitecture that accelerates hyper-asymmetric GEMM operations for up to 1.99x speedup and 81.4% EDP reduction.

Authors:Ching-Chun Chang, Isao Echizen
Title: Steganography Beyond Space-Time with Chain of Multimodal AI
Abstract:
Steganography is the art and science of covert writing, with a broad range of applications interwoven within the realm of cybersecurity. As artificial intelligence continues to evolve, its ability to synthesise realistic content emerges as a threat in the hands of cybercriminals who seek to manipulate and misrepresent the truth. Such synthetic content introduces a non-trivial risk of overwriting the subtle changes made for the purpose of steganography. When the signals in both the spatial and temporal domains are vulnerable to unforeseen overwriting, it calls for reflection on what, if any, remains invariant. This study proposes a paradigm in steganography for audiovisual media, where messages are concealed beyond both spatial and temporal domains. A chain of multimodal artificial intelligence is developed to deconstruct audiovisual content into a cover text, embed a message within the linguistic domain, and then reconstruct the audiovisual content through synchronising both auditory and visual modalities with the resultant stego text. The message is encoded by biasing the word sampling process of a language generation model and decoded by analysing the probability distribution of word choices. The accuracy of message transmission is evaluated under both zero-bit and multi-bit capacity settings. Fidelity is assessed through both biometric and semantic similarities, capturing the identities of the recorded face and voice, as well as the core ideas conveyed through the media. Secrecy is examined through statistical comparisons between cover and stego texts. Robustness is tested across various scenarios, including audiovisual resampling, face-swapping, voice-cloning and their combinations.
中文: 本研究提出了一种新颖的视听媒体隐写范式,通过多模态人工智能将信息嵌入语言特征并重建同步媒体,将消息隐藏于时空域之外,经全面测试在保真度、隐蔽性和准确性方面均展现出对抗多种篡改操作的鲁棒性。
English: This study introduces a novel steganography paradigm for audiovisual media that conceals messages beyond spatial and temporal domains by using multimodal AI to embed information in linguistic features and reconstruct synchronized media, demonstrating robustness against various manipulations through comprehensive testing of fidelity, secrecy, and accuracy.

Authors:Rikuto Kotoge, Ziwei Yang, Zheng Chen, Yushun Dong, Yasuko Matsubara, Jimeng Sun, Yasushi Sakurai
Title: ExPath: Targeted Pathway Inference for Biological Knowledge Bases via Graph Learning and Explanation
Abstract:
Retrieving targeted pathways in biological knowledge bases, particularly when incorporating wet-lab experimental data, remains a challenging task and often requires downstream analyses and specialized expertise. In this paper, we frame this challenge as a solvable graph learning and explaining task and propose a novel subgraph inference framework, ExPAth, that explicitly integrates experimental data to classify various graphs (bio-networks) in biological databases. The links (representing pathways) that contribute more to classification can be considered as targeted pathways. Our framework can seamlessly integrate biological foundation models to encode the experimental molecular data. We propose ML-oriented biological evaluations and a new metric. The experiments involving 301 bio-networks evaluations demonstrate that pathways inferred by ExPath are biologically meaningful, achieving up to 4.5x higher Fidelity+ (necessity) and 14x lower Fidelity- (sufficiency) than explainer baselines, while preserving signaling chains up to 4x longer.
Chinese: ExPAth是一种新颖的子图推断框架,通过整合实验数据对生物网络进行分类并识别目标通路,在301个生物网络评估中比基线方法实现了显著更高的保真度和更长的信号链保留。
English: ExPAth is a novel subgraph inference framework that integrates experimental data to classify biological networks and identify targeted pathways, achieving significantly higher fidelity and longer signaling chains than baseline methods in evaluations across 301 bio-networks.

Authors:David Noever, Forrest McKee
Title: AirTag, You're It: Reverse Logistics and Last Mile Dynamics
Abstract:
This study addresses challenges in reverse logistics, a frequently overlooked but essential component of last-mile delivery, particularly in disaster relief scenarios where infrastructure disruptions demand adaptive solutions. While hub-and-spoke logistics networks excel at long-distance scalability, they often fail to optimize closely spaced spokes reliant on distant hubs, introducing inefficiencies in transit times and resource allocation. Using 20 Apple AirTags embedded in packages, this research provides empirical insights into logistical flows, capturing granular spatial and temporal data through Bluetooth LE (BLE) 5 trackers integrated with the Apple Find My network. These trackers demonstrated their value in monitoring dynamic cargo movements, enabling real-time adjustments in mobile hub placement and route optimization, particularly in disaster relief contexts like Hurricane Helene. A novel application of discrete event simulation (DES) further explored the saddle point in hub-spoke configurations, where excessive hub reliance clashes with diminishing spoke interaction demand. By coupling simulation results with empirical AirTag tracking, the study highlights the potential of BLE technology to refine reverse logistics, reduce delays, and improve operational flexibility in both routine and crisis-driven delivery networks.
中文摘要:本研究通过蓝牙低能耗追踪技术和离散事件仿真,展示了如何在灾难救援等场景中优化逆向物流,实现实时路线调整并改进枢纽辐射式网络配置,从而提高运营效率。
English Summary: This research demonstrates how Bluetooth LE tracking technology and discrete event simulation can optimize reverse logistics by improving real-time route adjustments and hub-spoke configurations, particularly in disaster relief scenarios like Hurricane Helene.

Authors:Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Borui Zhang, Runlin Guo, Jia Li
Title: CodeSwift: Accelerating LLM Inference for Efficient Code Generation
Abstract:
Code generation is a latency-sensitive task that demands high timeliness, but the autoregressive decoding mechanism of Large Language Models (LLMs) leads to poor inference efficiency. Existing LLM inference acceleration methods mainly focus on standalone functions using only built-in components. Moreover, they treat code like natural language sequences, ignoring its unique syntax and semantic characteristics. As a result, the effectiveness of these approaches in code generation tasks remains limited and fails to align with real-world programming scenarios. To alleviate this issue, we propose CodeSwift, a simple yet highly efficient inference acceleration approach specifically designed for code generation, without comprising the quality of the output. CodeSwift constructs a multi-source datastore, providing access to both general and project-specific knowledge, facilitating the retrieval of high-quality draft sequences. Moreover, CodeSwift reduces retrieval cost by controlling retrieval timing, and enhances efficiency through parallel retrieval and a context- and LLM preference-aware cache. Experimental results show that CodeSwift can reach up to 2.53x and 2.54x speedup compared to autoregressive decoding in repository-level and standalone code generation tasks, respectively, outperforming state-of-the-art inference acceleration approaches by up to 88%.
中文: 代码生成任务对效率要求高,但现有方法多注重正确性而忽视推理速度,因此提出FastCoder这一专用加速方案,通过优化检索和缓存机制在不降低质量的前提下显著提升生成速度。
English: Code generation requires high efficiency, but current methods often neglect inference speed in favor of correctness, prompting the development of FastCoder, a specialized acceleration approach that enhances speed without sacrificing quality through optimized retrieval and caching mechanisms.

Authors:Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Jia Li, Lin Shi
Title: FastCoder: Accelerating Repository-level Code Generation via Efficient Retrieval and Verification
Abstract:
Code generation is a latency-sensitive task that demands high timeliness. However, with the growing interest and inherent difficulty in repository-level code generation, most existing code generation studies focus on improving the correctness of generated code while overlooking the inference efficiency, which is substantially affected by the overhead during LLM generation. Although there has been work on accelerating LLM inference, these approaches are not tailored to the specific characteristics of code generation; instead, they treat code the same as natural language sequences and ignore its unique syntax and semantic characteristics, which are also crucial for improving efficiency. Consequently, these approaches exhibit limited effectiveness in code generation tasks, particularly for repository-level scenarios with considerable complexity and difficulty. To alleviate this issue, following draft-verification paradigm, we propose FastCoder, a simple yet highly efficient inference acceleration approach specifically designed for code generation, without compromising the quality of the output. FastCoder constructs a multi-source datastore, providing access to both general and project-specific knowledge, facilitating the retrieval of high-quality draft sequences. Moreover, FastCoder reduces the retrieval cost by controlling retrieval timing, and enhances efficiency through parallel retrieval and a context- and LLM preference-aware cache. Experimental results show that FastCoder can reach up to 2.53x and 2.54x speedup compared to autoregressive decoding in repository-level and standalone code generation tasks, respectively, outperforming state-of-the-art inference acceleration approaches by up to 88%. FastCoder can also be integrated with existing correctness-focused code generation approaches to accelerate the LLM generation process, and reach a speedup exceeding 2.6x.
中文: 代码生成任务对效率要求高,但现有方法多注重正确性而忽视推理速度,因此提出FastCoder这一专用加速方案,通过优化检索和缓存机制在不降低质量的前提下显著提升生成速度。
English: Code generation requires high efficiency, but current methods often neglect inference speed in favor of correctness, prompting the development of FastCoder, a specialized acceleration approach that enhances speed without sacrificing quality through optimized retrieval and caching mechanisms.

Authors:Yilin Geng, Haonan Li, Honglin Mu, Xudong Han, Timothy Baldwin, Omri Abend, Eduard Hovy, Lea Frermann
Title: Control Illusion: The Failure of Instruction Hierarchies in Large Language Models
Abstract:
Large language models (LLMs) are increasingly deployed with hierarchical instruction schemes, where certain instructions (e.g., system-level directives) are expected to take precedence over others (e.g., user messages). Yet, we lack a systematic understanding of how effectively these hierarchical control mechanisms work. We introduce a systematic evaluation framework based on constraint prioritization to assess how well LLMs enforce instruction hierarchies. Our experiments across six state-of-the-art LLMs reveal that models struggle with consistent instruction prioritization, even for simple formatting conflicts. We find that the widely-adopted system/user prompt separation fails to establish a reliable instruction hierarchy, and models exhibit strong inherent biases toward certain constraint types regardless of their priority designation. We find that LLMs more reliably obey constraints framed through natural social hierarchies (e.g., authority, expertise, consensus) than system/user roles, which suggests that pretraining-derived social structures act as latent control priors, with potentially stronger influence than post-training guardrails.
中文摘要:大型语言模型在指令优先级执行上存在不一致性,其更倾向于遵循预训练时习得的社会层级结构,而非通过提示词设置的系统/用户角色区分。
English Summary: Large language models struggle with consistent instruction prioritization, showing stronger bias toward social hierarchies learned during pretraining than toward system/user role distinctions implemented through prompting.

Authors:Dahyun Jung, Jaehyung Seo, Jaewook Lee, Chanjun Park, Heuiseok Lim
Title: CoME: An Unlearning-based Approach to Conflict-free Model Editing
Abstract:
Large language models (LLMs) often retain outdated or incorrect information from pre-training, which undermines their reliability. While model editing methods have been developed to address such errors without full re-training, they frequently suffer from knowledge conflicts, where outdated information interferes with new knowledge. In this work, we propose Conflict-free Model Editing (CoME), a novel framework that enhances the accuracy of knowledge updates in LLMs by selectively removing outdated knowledge. CoME leverages unlearning to mitigate knowledge interference, allowing new information to be integrated without compromising relevant linguistic features. Through experiments on GPT-J and LLaMA-3 using Counterfact and ZsRE datasets, we demonstrate that CoME improves both editing accuracy and model reliability when applied to existing editing methods. Our results highlight that the targeted removal of outdated knowledge is crucial for enhancing model editing effectiveness and maintaining the model's generative performance.
中文:提出的无冲突模型编辑(CoME)框架通过选择性遗忘机制消除过时知识,在提升大语言模型编辑精度和可靠性的同时,保持其语言生成能力不受影响。
English: The proposed Conflict-free Model Editing (CoME) framework improves LLM knowledge updates by selectively removing outdated information through unlearning, enhancing editing accuracy and model reliability without compromising linguistic performance.

Authors:Hanqi Yan, Xiangxiang Cui, Lu Yin, Paul Pu Liang, Yulan He, Yifei Wang
Title: Multi-Faceted Multimodal Monosemanticity
Abstract:
Humans experience the world through multiple modalities, such as, vision, language, and speech, making it natural to explore the commonality and distinctions among them. In this work, we take a data-driven approach to address this question by analyzing interpretable, monosemantic features extracted from deep multimodal models. Specifically, we investigate CLIP, a prominent visual-language representation model trained on massive image-text pairs. Building on prior research in single-modal interpretability, we develop a set of multi-modal interpretability tools and measures designed to disentangle and analyze features learned from CLIP. Specifically, we introduce the Modality Dominance Score (MDS) to attribute each CLIP feature to a specific modality. We then map CLIP features into a more interpretable space, enabling us to categorize them into three distinct classes: vision features (single-modal), language features (single-modal), and visual-language features (cross-modal). Interestingly, this data-driven categorization closely aligns with human intuitive understandings of different modalities. We further show that this modality decomposition can benefit multiple downstream tasks, including reducing bias in gender detection, generating cross-modal adversarial examples, and enabling modal-specific feature control in text-to-image generation. These results indicate that large-scale multimodal models, when equipped with task-agnostic interpretability tools, can offer valuable insights into the relationships between different data modalities.
中文摘要:本研究开发了可解释性工具来分析CLIP中的多模态特征,将其分为视觉、语言和跨模态三类,这些分类有助于减少性别检测偏见和提升跨模态生成等下游任务性能。
English Summary: This study introduces interpretability tools to analyze multimodal features in CLIP, categorizing them into vision, language, and cross-modal types, which enhances tasks like bias reduction and cross-modal generation.

Authors:Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Matei Zaharia, James Zou, Ion Stoica
Title: Optimizing Model Selection for Compound AI Systems
Abstract:
Compound AI systems that combine multiple LLM calls, such as self-refine and multi-agent-debate, achieve strong performance on many AI tasks. We address a core question in optimizing compound systems: for each LLM call or module in the system, how should one decide which LLM to use? We show that these LLM choices have a large effect on quality, but the search space is exponential. We propose LLMSelector, an efficient framework for model selection in compound systems, which leverages two key empirical insights: (i) end-to-end performance is often monotonic in how well each module performs, with all other modules held fixed, and (ii) per-module performance can be estimated accurately by an LLM. Building upon these insights, LLMSelector iteratively selects one module and allocates to it the model with the highest module-wise performance, as estimated by an LLM, until no further gain is possible. LLMSelector is applicable to any compound system with a bounded number of modules, and its number of API calls scales linearly with the number of modules, achieving high-quality model allocation both empirically and theoretically. Experiments with popular compound systems such as multi-agent debate and self-refine using LLMs such as GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 show that LLMSelector confers 5%-70% accuracy gains compared to using the same LLM for all modules.
中文: 复合AI系统虽性能优异,但面临模块间LLM选择的指数级复杂度;LLMSelector框架通过基于LLM性能评估的迭代模块化模型分配,高效解决了该问题,实现了5%-70%的准确率提升。
English: Compound AI systems achieve high performance but face exponential complexity in selecting optimal LLMs for each module, which is efficiently addressed by LLMSelector through iterative module-wise model allocation based on LLM-estimated performance, yielding 5%-70% accuracy improvements.

Authors:Hongji Yang, Wencheng Han, Yucheng Zhou, Jianbing Shen
Title: DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models
Abstract:
In this paper, we introduce DC (Decouple)-ControlNet, a highly flexible and precisely controllable framework for multi-condition image generation. The core idea behind DC-ControlNet is to decouple control conditions, transforming global control into a hierarchical system that integrates distinct elements, contents, and layouts. This enables users to mix these individual conditions with greater flexibility, leading to more efficient and accurate image generation control. Previous ControlNet-based models rely solely on global conditions, which affect the entire image and lack the ability of element- or region-specific control. This limitation reduces flexibility and can cause condition misunderstandings in multi-conditional image generation. To address these challenges, we propose both intra-element and Inter-element Controllers in DC-ControlNet. The Intra-Element Controller handles different types of control signals within individual elements, accurately describing the content and layout characteristics of the object. For interactions between elements, we introduce the Inter-Element Controller, which accurately handles multi-element interactions and occlusion based on user-defined relationships. Extensive evaluations show that DC-ControlNet significantly outperforms existing ControlNet models and Layout-to-Image generative models in terms of control flexibility and precision in multi-condition control. Our project website is available at: https://um-lab.github.io/DC-ControlNet/
中文: DC-ControlNet提出了一种解耦的多条件图像生成框架,通过将控制条件分解为层级化的元素内与元素间控制器,显著提升了控制的灵活性和精确度。
English: DC-ControlNet introduces a decoupled framework for multi-condition image generation, enhancing flexibility and precision by separating control conditions into hierarchical intra-element and inter-element controllers.

Authors:Liang Chen, Li Shen, Yang Deng, Xiaoyan Zhao, Bin Liang, Kam-Fai Wong
Title: PEARL: Towards Permutation-Resilient LLMs
Abstract:
The in-context learning (ICL) capability of large language models (LLMs) enables them to perform challenging tasks using provided demonstrations. However, ICL is highly sensitive to the ordering of demonstrations, leading to instability in predictions. This paper shows that this vulnerability can be exploited to design a natural attack - difficult for model providers to detect - that achieves nearly 80% success rate on LLaMA-3 by simply permuting the demonstrations. Existing mitigation methods primarily rely on post-processing and fail to enhance the model's inherent robustness to input permutations, raising concerns about safety and reliability of LLMs. To address this issue, we propose Permutation-resilient learning (PEARL), a novel framework based on distributionally robust optimization (DRO), which optimizes model performance against the worst-case input permutation. Specifically, PEARL consists of a permutation-proposal network (P-Net) and the LLM. The P-Net generates the most challenging permutations by treating it as an optimal transport problem, which is solved using an entropy-constrained Sinkhorn algorithm. Through minimax optimization, the P-Net and the LLM iteratively optimize against each other, progressively improving the LLM's robustness. Experiments on synthetic pre-training and real-world instruction tuning tasks demonstrate that PEARL effectively mitigates permutation attacks and enhances performance. Notably, despite being trained on fewer shots and shorter contexts, PEARL achieves performance gains of up to 40% when scaled to many-shot and long-context scenarios, highlighting its efficiency and generalization capabilities.
中文: 本文揭示大语言模型的上下文学习易受演示顺序攻击,提出PEARL鲁棒学习框架,通过极小极大优化增强模型抗干扰能力,实验证明其能有效提升性能并具备良好泛化性。
English: This paper reveals that large language models' in-context learning is vulnerable to permutation attacks and proposes PEARL, a robust learning framework that enhances model resilience through minimax optimization, achieving significant performance improvements.

Authors:Zhiwei Liu, Kailai Yang, Eduard Hovy, Sophia Ananiadou
Title: Rumor Detection by Multi-task Suffix Learning based on Time-series Dual Sentiments
Abstract:
The widespread dissemination of rumors on social media has a significant impact on people's lives, potentially leading to public panic and fear. Rumors often evoke specific sentiments, resonating with readers and prompting sharing. To effectively detect and track rumors, it is essential to observe the fine-grained sentiments of both source and response message pairs as the rumor evolves over time. However, current rumor detection methods fail to account for this aspect. In this paper, we propose MSuf, the first multi-task suffix learning framework for rumor detection and tracking using time series dual (coupled) sentiments. MSuf includes three modules: (1) an LLM to extract sentiment intensity features and sort them chronologically; (2) a module that fuses the sorted sentiment features with their source text word embeddings to obtain an aligned embedding; (3) two hard prompts are combined with the aligned vector to perform rumor detection and sentiment analysis using one frozen LLM. MSuf effectively enhances the performance of LLMs for rumor detection with only minimal parameter fine-tuning. Evaluating MSuf on four rumor detection benchmarks, we find significant improvements compared to other emotion-based methods.
中文:MSuf是一种创新的多任务框架,通过分析源消息与回复消息间的时间序列情感对来改进谣言检测,仅需微调极少量参数即可显著提升大语言模型在基准测试中的性能。
English: MSuf is a novel multi-task framework that improves rumor detection by analyzing time-series sentiment pairs between source and response messages, requiring minimal fine-tuning to significantly boost LLM performance on benchmarks.

Authors:Lingfeng Zhang, Yuecheng Liu, Zhanguang Zhang, Matin Aghaei, Yaochen Hu, Hongjian Gu, Mohammad Ali Alomrani, David Gamaliel Arcos Bravo, Raika Karimi, Atia Hamidizadeh, Haoping Xu, Guowei Huang, Zhanpeng Zhang, Tongtong Cao, Weichao Qiu, Xingyue Quan, Jianye Hao, Yuzheng Zhuang, Yingxue Zhang
Title: Mem2Ego: Empowering Vision-Language Models with Global-to-Ego Memory for Long-Horizon Embodied Navigation
Abstract:
Recent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have made them powerful tools in embodied navigation, enabling agents to leverage commonsense and spatial reasoning for efficient exploration in unfamiliar environments. Existing LLM-based approaches convert global memory, such as semantic or topological maps, into language descriptions to guide navigation. While this improves efficiency and reduces redundant exploration, the loss of geometric information in language-based representations hinders spatial reasoning, especially in intricate environments. To address this, VLM-based approaches directly process ego-centric visual inputs to select optimal directions for exploration. However, relying solely on a first-person perspective makes navigation a partially observed decision-making problem, leading to suboptimal decisions in complex environments. In this paper, we present a novel vision-language model (VLM)-based navigation framework that addresses these challenges by adaptively retrieving task-relevant cues from a global memory module and integrating them with the agent's egocentric observations. By dynamically aligning global contextual information with local perception, our approach enhances spatial reasoning and decision-making in long-horizon tasks. Experimental results demonstrate that the proposed method surpasses previous state-of-the-art approaches in object navigation tasks, providing a more effective and scalable solution for embodied navigation.
中文: 本文提出了一种新型视觉语言模型导航框架,通过自适应整合全局记忆线索与局部感知信息来增强空间推理能力,在物体导航任务中超越了现有最优方法。
English: This paper introduces a novel VLM-based navigation framework that enhances spatial reasoning by adaptively integrating global memory cues with local egocentric observations, outperforming previous methods in object navigation tasks.

Authors:Chak Tou Leong, Qingyu Yin, Jian Wang, Wenjie Li
Title: Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region
Abstract:
The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs' safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models' safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models' susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.
中文摘要:大语言模型的安全对齐因过度依赖模板区域而存在漏洞,易受越狱攻击影响,可通过将安全机制与模板区域分离来缓解此问题。
English Summary: Large language models' safety alignment is vulnerable due to over-reliance on template regions, making them susceptible to jailbreak attacks, which can be mitigated by detaching safety mechanisms from these templates.

Authors:Renxi Wang, Honglin Mu, Liqun Ma, Lizhi Lin, Yunlong Feng, Timothy Baldwin, Xudong Han, Haonan Li
Title: SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning
Abstract:
Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.
中文摘要:SCALAR是一个基于学术引文网络的新型基准,通过可控难度和动态更新机制自动评估大语言模型的长文本理解能力,揭示了其在处理科学文献方面的关键表现与局限。
English Summary: SCALAR is a novel benchmark using academic citation networks to automatically evaluate LLMs' long-context understanding through controllable difficulty levels and dynamic updates, revealing key insights about their capabilities with scientific documents.

Authors:Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, Muhao Chen
Title: ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails
Abstract:
Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.
Chinese: ThinkGuard是一种新型护栏模型,通过整合大语言模型的结构化批判来增强安全性,在多项安全基准测试中表现卓越,其准确率和F1分数较现有方法均有显著提升。
English: ThinkGuard is a novel guardrail model that enhances LLM safety by incorporating structured critiques from high-capacity models, achieving superior performance on safety benchmarks with significant accuracy and F1 score improvements over existing methods.

Authors:Jian Wang, Yinpei Dai, Yichi Zhang, Ziqiao Ma, Wenjie Li, Joyce Chai
Title: Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors
Abstract:
Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized knowledge in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students towards completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student's knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our approach can be extended beyond coding, providing valuable insights into advancing tutoring agents for human task learning.
Chinese: 本文提出TRAVER智能辅导框架,通过知识追踪与逐轮验证相结合的方法显著提升编程教学的任务完成率,其应用潜力可扩展至编程以外的广泛人类任务学习领域。
English: This paper introduces the TRAVER agent workflow, which enhances coding tutoring by combining knowledge tracing and turn-by-turn verification to significantly improve task completion success rates, with potential applications extending beyond coding to broader human task learning.

Authors:Junjun Pan, Yixin Liu, Xin Zheng, Yizhen Zheng, Alan Wee-Chung Liew, Fuyi Li, Shirui Pan
Title: A Label-Free Heterophily-Guided Approach for Unsupervised Graph Fraud Detection
Abstract:
Graph fraud detection (GFD) has rapidly advanced in protecting online services by identifying malicious fraudsters. Recent supervised GFD research highlights that heterophilic connections between fraudsters and users can greatly impact detection performance, since fraudsters tend to camouflage themselves by building more connections to benign users. Despite the promising performance of supervised GFD methods, the reliance on labels limits their applications to unsupervised scenarios; Additionally, accurately capturing complex and diverse heterophily patterns without labels poses a further challenge. To fill the gap, we propose a Heterophily-guided Unsupervised Graph fraud dEtection approach (HUGE) for unsupervised GFD, which contains two essential components: a heterophily estimation module and an alignment-based fraud detection module. In the heterophily estimation module, we design a novel label-free heterophily metric called HALO, which captures the critical graph properties for GFD, enabling its outstanding ability to estimate heterophily from node attributes. In the alignment-based fraud detection module, we develop a joint MLP-GNN architecture with ranking loss and asymmetric alignment loss. The ranking loss aligns the predicted fraud score with the relative order of HALO, providing an extra robustness guarantee by comparing heterophily among non-adjacent nodes. Moreover, the asymmetric alignment loss effectively utilizes structural information while alleviating the feature-smooth effects of GNNs. Extensive experiments on 6 datasets demonstrate that HUGE significantly outperforms competitors, showcasing its effectiveness and robustness.
Chinese Summary: 提出的HUGE方法通过无标签异质性估计模块和基于对齐的欺诈检测模块,解决了无监督图欺诈检测的难题,有效捕捉复杂异质性模式并在多个数据集上展现出卓越性能。
English Summary: The proposed HUGE method addresses unsupervised graph fraud detection by introducing a label-free heterophily estimation module and an alignment-based detection module, which together effectively capture complex heterophily patterns and demonstrate superior performance across multiple datasets.

Authors:Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig
Title: Pre-training Auto-regressive Robotic Models with 4D Representations
Abstract:
Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.
中文摘要:基础模型在自然语言处理和计算机视觉领域通过大规模无标注数据预训练实现了革命性突破,但机器人技术因标注成本高和物理世界建模不足而发展滞后;ARM4R通过从人类视频数据学习4D表征,构建自回归机器人模型,有效实现了从视频到机器人控制的迁移学习。
English Summary: Foundation models have transformed AI fields like NLP and vision with strong generalization, but robotics lags due to annotation costs and poor physical world modeling; ARM4R addresses this by using auto-regressive 4D representations from human videos to enable efficient transfer learning for robotic control.

Authors:Zenan Zhai, Hao Li, Xudong Han, Zhenxuan Zhang, Yixuan Zhang, Timothy Baldwin, Haonan Li
Title: RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises
Abstract:
Recent advances in large language models (LLMs) have shown that they can answer questions requiring complex reasoning. However, their ability to identify and respond to text containing logical fallacies or deliberately misleading premises remains less studied. To address this gap, we introduce RuozhiBench, a bilingual dataset comprising 677 carefully curated questions that contain various forms of deceptive reasoning, meticulously crafted through extensive human effort and expert review. In a comprehensive evaluation of 17 LLMs from 5 Series over RuozhiBench using both open-ended and two-choice formats, we conduct extensive analyses on evaluation protocols and result patterns. Despite their high scores on conventional benchmarks, these models showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.
Chinese: 大型语言模型在复杂推理方面表现出色,但在识别逻辑谬误方面能力有限,新推出的RuozhiBench显示,即使表现最佳的Claude-3-haiku模型准确率也仅为62%,远低于人类的90%以上。
English: Recent advances in large language models demonstrate their proficiency in complex reasoning, yet they struggle significantly with detecting logical fallacies, as shown by the newly introduced RuozhiBench where even top models like Claude-3-haiku achieved only 62% accuracy compared to humans' over 90%.

Authors:Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, Jiangmiao Pang
Title: HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit
Abstract:
Generalizable humanoid loco-manipulation poses significant challenges, requiring coordinated whole-body control and precise, contact-rich object manipulation. To address this, this paper introduces HOMIE, a semi-autonomous teleoperation system that combines a reinforcement learning policy for body control mapped to a pedal, an isomorphic exoskeleton arm for arm control, and motion-sensing gloves for hand control, forming a unified cockpit to freely operate humanoids and establish a data flywheel. The policy incorporates novel designs, including an upper-body pose curriculum, a height-tracking reward, and symmetry utilization. These features enable the system to perform walking and squatting to specific heights while seamlessly adapting to arbitrary upper-body poses. The exoskeleton, by eliminating the reliance on inverse dynamics, delivers faster and more precise arm control. The gloves utilize Hall sensors instead of servos, allowing even compact devices to achieve 15 or more degrees of freedom and freely adapt to any model of dexterous hands. Compared to previous teleoperation systems, HOMIE stands out for its exceptional efficiency, completing tasks in half the time; its expanded working range, allowing users to freely reach high and low areas as well as interact with any objects; and its affordability, with a price of just $500. The system is fully open-source, demos and code can be found in our https://homietele.github.io/.
中文: HOMIE是一个创新的半自主遥操作系统,它结合了踏板映射强化学习策略控制身体、同构外骨骼控制手臂和运动感应手套控制双手,以高效、经济的方式扩展了仿人机器人的操作范围,并完全开源。
English: HOMIE is an innovative semi-autonomous teleoperation system that integrates a pedal-mapped reinforcement learning policy for body movement, an isomorphic exoskeleton for precise arm control, and motion-sensing gloves for hand manipulation, enabling efficient and affordable humanoid operation with expanded capabilities and full open-source availability.

Authors:Penghui Zhang, Hua Zhang, Yuqi Dai, Cheng Zeng, Jingyu Wang, Jianxin Liao
Title: NTP-INT: Network Traffic Prediction-Driven In-band Network Telemetry for High-load Switches
Abstract:
In-band network telemetry (INT) is essential to network management due to its real-time visibility. However, because of the rapid increase in network devices and services, it has become crucial to have targeted access to detailed network information in a dynamic network environment. This paper proposes an intelligent network telemetry system called NTP-INT to obtain more fine-grained network information on high-load switches. Specifically, NTP-INT consists of three modules: network traffic prediction module, network pruning module, and probe path planning module. Firstly, the network traffic prediction module adopts a Multi-Temporal Graph Neural Network (MTGNN) to predict future network traffic and identify high-load switches. Then, we design the network pruning algorithm to generate a subnetwork covering all high-load switches to reduce the complexity of probe path planning. Finally, the probe path planning module uses an attention-mechanism-based deep reinforcement learning (DEL) model to plan efficient probe paths in the network slice. The experimental results demonstrate that NTP-INT can acquire more precise network information on high-load switches while decreasing the control overhead by 50\%.
中文摘要:NTP-INT智能网络遥测系统通过流量预测、网络剪枝和探测路径规划三大模块,能在高负载交换机上获取更精确的网络信息,同时将控制开销降低50%。
English Summary: NTP-INT is an intelligent network telemetry system that uses traffic prediction, network pruning, and probe path planning to obtain precise data from high-load switches while reducing control overhead by 50%.

Authors:Luca A. Lanzendörfer, Florian Grötschla, Michael Ungersböck, Roger Wattenhofer
Title: High-Fidelity Music Vocoder using Neural Audio Codecs
Abstract:
While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.
中文: DisCoder提出了一种基于GAN的编码器-解码器神经声码器,利用神经音频编解码器从梅尔频谱重建44.1 kHz高保真音频,在音乐合成上达到领先水平,在语音合成中也展现出竞争力。
English: DisCoder introduces a neural vocoder using a GAN-based encoder-decoder with a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms, achieving state-of-the-art results in music synthesis and competitive performance in speech.

Authors:Peizhuo Li, Hongyi Li, Ge Sun, Jin Cheng, Xinrong Yang, Guillaume Bellegarda, Milad Shafiee, Yuhong Cao, Auke Ijspeert, Guillaume Sartoretti
Title: SATA: Safe and Adaptive Torque-Based Locomotion Policies Inspired by Animal Learning
Abstract:
Despite recent advances in learning-based controllers for legged robots, deployments in human-centric environments remain limited by safety concerns. Most of these approaches use position-based control, where policies output target joint angles that must be processed by a low-level controller (e.g., PD or impedance controllers) to compute joint torques. Although impressive results have been achieved in controlled real-world scenarios, these methods often struggle with compliance and adaptability when encountering environments or disturbances unseen during training, potentially resulting in extreme or unsafe behaviors. Inspired by how animals achieve smooth and adaptive movements by controlling muscle extension and contraction, torque-based policies offer a promising alternative by enabling precise and direct control of the actuators in torque space. In principle, this approach facilitates more effective interactions with the environment, resulting in safer and more adaptable behaviors. However, challenges such as a highly nonlinear state space and inefficient exploration during training have hindered their broader adoption. To address these limitations, we propose SATA, a bio-inspired framework that mimics key biomechanical principles and adaptive learning mechanisms observed in animal locomotion. Our approach effectively addresses the inherent challenges of learning torque-based policies by significantly improving early-stage exploration, leading to high-performance final policies. Remarkably, our method achieves zero-shot sim-to-real transfer. Our experimental results indicate that SATA demonstrates remarkable compliance and safety, even in challenging environments such as soft/slippery terrain or narrow passages, and under significant external disturbances, highlighting its potential for practical deployments in human-centric and safety-critical scenarios.
中文:SATA框架提出了一种仿生扭矩控制方法,通过提升训练效率和实现零样本仿真到现实迁移,显著增强了腿式机器人的安全性与适应性,在复杂环境中展现出卓越性能。
English: The SATA framework introduces a bio-inspired, torque-based control approach for legged robots that enhances safety and adaptability by improving training efficiency and enabling zero-shot sim-to-real transfer, demonstrating robust performance in challenging environments.

Authors:Benedikt Oppeneiger, Manuel Schaller, Karl Worthmann
Title: Spatial decay of perturbations in hyperbolic equations with optimal boundary control
Abstract:
Recently, domain-uniform stabilizability and detectability has been the central assumption %in order robustness results on the to ensure robustness in the sense of exponential decay of spatially localized perturbations in optimally controlled evolution equations. In the present paper we analyze a chain of transport equations with boundary and point controls with regard to this property. Both for Dirichlet and Neumann boundary and coupling conditions, we show a necessary and sufficient criterion on control domains which allow for the domain-uniform stabilization of this equation. We illustrate the results by means of a numerical example.
中文摘要:本文分析了具有边界和点控制的传输方程的领域一致可稳性,在不同边界条件下建立了控制域的必要和充分判据,并通过数值算例进行了验证。
English Summary: This paper analyzes domain-uniform stabilizability for transport equations with boundary and point controls, establishing necessary and sufficient criteria for control domains under various boundary conditions, supported by numerical examples.

Authors:Runze Liu, Chenjia Bai, Jiafei Lyu, Shengjie Sun, Yali Du, Xiu Li
Title: VLP: Vision-Language Preference Learning for Embodied Manipulation
Abstract:
Reward engineering is one of the key challenges in Reinforcement Learning (RL). Preference-based RL effectively addresses this issue by learning from human feedback. However, it is both time-consuming and expensive to collect human preference labels. In this paper, we propose a novel \textbf{V}ision-\textbf{L}anguage \textbf{P}reference learning framework, named \textbf{VLP}, which learns a vision-language preference model to provide preference feedback for embodied manipulation tasks. To achieve this, we define three types of language-conditioned preferences and construct a vision-language preference dataset, which contains versatile implicit preference orders without human annotations. The preference model learns to extract language-related features, and then serves as a preference annotator in various downstream tasks. The policy can be learned according to the annotated preferences via reward learning or direct policy optimization. Extensive empirical results on simulated embodied manipulation tasks demonstrate that our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin.
中文摘要:VLP框架提出了一种视觉语言偏好学习模型,能够自主为具身操作任务提供偏好反馈,无需人工标注且在处理未知任务和指令时显著优于基线方法。
English Summary: The VLP framework introduces a vision-language preference model that autonomously generates preference feedback for embodied manipulation tasks, eliminating the need for costly human annotations and demonstrating superior performance in unseen scenarios.

Authors:Chunan Yu, Yidong Han, Chaotao Ding, Ying Zang, Lanyun Zhu, Xinhao Chen, Zejian Li, Renjun Xu, Tianrun Chen
Title: Syllables to Scenes: Literary-Guided Free-Viewpoint 3D Scene Synthesis from Japanese Haiku
Abstract:
In the era of the metaverse, where immersive technologies redefine human experiences, translating abstract literary concepts into navigable 3D environments presents a fundamental challenge in preserving semantic and emotional fidelity. This research introduces HaikuVerse, a novel framework for transforming poetic abstraction into spatial representation, with Japanese Haiku serving as an ideal test case due to its sophisticated encapsulation of profound emotions and imagery within minimal text. While existing text-to-3D methods struggle with nuanced interpretations, we present a literary-guided approach that synergizes traditional poetry analysis with advanced generative technologies. Our framework centers on two key innovations: (1) Hierarchical Literary-Criticism Theory Grounded Parsing (H-LCTGP), which captures both explicit imagery and implicit emotional resonance through structured semantic decomposition, and (2) Progressive Dimensional Synthesis (PDS), a multi-stage pipeline that systematically transforms poetic elements into coherent 3D scenes through sequential diffusion processes, geometric optimization, and real-time enhancement. Extensive experiments demonstrate that HaikuVerse significantly outperforms conventional text-to-3D approaches in both literary fidelity and visual quality, establishing a new paradigm for preserving cultural heritage in immersive digital spaces. Project website at: https://syllables-to-scenes.github.io/
中文: 本研究提出HaikuVerse创新框架,通过结合传统诗歌分析与先进生成技术,将诗歌抽象转化为三维场景,在保持文学意蕴和视觉质量方面显著优于传统文本转3D方法。
English: This research introduces HaikuVerse, a novel framework that transforms poetic abstraction into 3D environments by synergizing literary analysis with generative technologies, significantly outperforming conventional methods in preserving semantic and emotional fidelity.

Authors:Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, Min zhang
Title: Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis
Abstract:
The o1-Like LLMs are transforming AI by simulating human cognitive processes, but their performance in multilingual machine translation (MMT) remains underexplored. This study examines: (1) how o1-Like LLMs perform in MMT tasks and (2) what factors influence their translation quality. We evaluate multiple o1-Like LLMs and compare them with traditional models like ChatGPT and GPT-4o. Results show that o1-Like LLMs establish new multilingual translation benchmarks, with DeepSeek-R1 surpassing GPT-4o in contextless tasks. They demonstrate strengths in historical and cultural translation but exhibit a tendency for rambling issues in Chinese-centric outputs. Further analysis reveals three key insights: (1) High inference costs and slower processing speeds make complex translation tasks more resource-intensive. (2) Translation quality improves with model size, enhancing commonsense reasoning and cultural translation. (3) The temperature parameter significantly impacts output quality-lower temperatures yield more stable and accurate translations, while higher temperatures reduce coherence and precision.
中文:研究发现o1系列大模型在多语言机器翻译中创下新标杆,DeepSeek-R1在无上下文任务中超越GPT-4o,但也面临高计算成本、中文输出冗长等问题,而翻译质量随模型增大和温度参数降低而提升。
English: This study finds that o1-Like LLMs set new benchmarks in multilingual machine translation, with DeepSeek-R1 outperforming GPT-4o in context-free tasks, though they face challenges like high computational costs and rambling in Chinese outputs, while translation quality improves with larger models and lower temperature settings.

Authors:Hui Huang, Jiaheng Liu, Yancheng He, Shilong Li, Bing Xu, Conghui Zhu, Muyun Yang, Tiejun Zhao
Title: MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training
Abstract:
Complex instruction-following with elaborate constraints is imperative for Large Language Models (LLMs). While existing methods have constructed data for complex instruction alignment, they all rely on a more advanced model, especially GPT-4, limiting their application. In this paper, we propose a Multi-granularity Self-Contrastive Training (MuSC) framework, to improve the complex instruction alignment without relying on a stronger model. Our method is conducted on both coarse and fine granularity. On coarse-granularity, we construct constraint-aware preference data based on instruction decomposition and recombination. On fine-granularity, we perform token-aware preference optimization with dynamic token-level supervision. Our method is evaluated on open-sourced models, and experiment results show our method achieves significant improvement on both complex and general instruction-following benchmarks, surpassing previous self-alignment methods.
中文摘要:本文提出多粒度自对比训练框架,通过约束感知的粗粒度优化和动态令牌级的细粒度监督,在不依赖更强模型的情况下显著提升了大型语言模型的复杂指令对齐能力。
English Summary: This paper introduces the Multi-granularity Self-Contrastive Training (MuSC) framework to enhance complex instruction alignment in LLMs without relying on superior models, achieving significant improvements through constraint-aware and token-level optimization techniques.

Authors:Hieu Nguyen, Zihao He, Shoumik Atul Gandre, Ujjwal Pasupulety, Sharanya Kumari Shivakumar, Kristina Lerman
Title: Smoothing Out Hallucinations: Mitigating LLM Hallucination with Smoothed Knowledge Distillation
Abstract:
Large language models (LLMs) often suffer from hallucination, generating factually incorrect or ungrounded content, which limits their reliability in high-stakes applications. A key factor contributing to hallucination is the use of hard labels during training, which enforce deterministic supervision, encourage overconfidence, and disregard the uncertainty inherent in natural language. To address this, we propose mitigating hallucination through knowledge distillation (KD), where a teacher model provides smoothed soft labels to a student model, reducing overconfidence and improving factual grounding. We apply KD during supervised finetuning on instructional data, evaluating its effectiveness across LLMs from different families. Experimental results on summarization benchmarks demonstrate that KD reduces hallucination compared to standard finetuning while preserving performance on general NLP tasks. These findings highlight KD as a promising approach for mitigating hallucination in LLMs and improving model reliability.
Chinese: 通过教师模型提供的软标签进行知识蒸馏,可在监督微调中减少大语言模型的幻觉现象,在保持通用自然语言处理任务性能的同时增强事实依据。
English: Knowledge distillation with soft labels from a teacher model reduces hallucination in large language models during supervised finetuning, improving factual grounding while maintaining general NLP task performance.

Authors:Kaikai Zhao, Zhaoxiang Liu, Xuejiao Lei, Jiaojiao Zhao, Zhenhong Long, Zipeng Wang, Ning Wang, Meijuan An, Qingliang Meng, Peijun Yang, Minjie Hua, Chaoyang Ma, Wen Liu, Kai Wang, Shiguo Lian
Title: Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis
Abstract:
DeepSeek-R1, known for its low training cost and exceptional reasoning capabilities, has achieved state-of-the-art performance on various benchmarks. However, detailed evaluations for DeepSeek Series models from the perspective of real-world applications are lacking, making it challenging for users to select the most suitable DeepSeek models for their specific needs. To address this gap, we presents the first comprehensive evaluation of the DeepSeek and its related models (including DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, DeepSeek-R1-Distill-Llama series, their corresponding 4-bit quantized models, and the reasoning model QwQ-32B) using our enhanced A-Eval benchmark, A-Eval-2.0. Our systematic analysis reveals several key insights: (1) Given identical model architectures and training data, larger parameter models demonstrate superior performance, aligning with the scaling law. However, smaller models may achieve enhanced capabilities when employing optimized training strategies and higher-quality data; (2) Reasoning-enhanced model show significant performance gains in logical reasoning tasks but may underperform in text understanding and generation tasks; (3) As the data difficulty increases, distillation or reasoning enhancements yield higher performance gains for the models. Interestingly, reasoning enhancements can even have a negative impact on simpler problems; (4) Quantization impacts different capabilities unevenly, with significant drop on logical reasoning and minimal impact on text generation. Based on these results and findings, we design an model selection handbook enabling users to select the most cost-effective models without efforts.
中文: DeepSeek-R1虽具备低成本训练和卓越推理能力,但缺乏实际应用评估,为此我们通过系统研究揭示了不同模型变体的性能特点,并制定了便捷的选型手册。
English: DeepSeek-R1 offers cost-effective training and top-tier reasoning but lacks real-world application guidance, prompting a comprehensive evaluation that reveals key performance insights across model variants and introduces a practical selection handbook.

Authors:Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Ning Wang, Zhenhong Long, Peijun Yang, Jiaojiao Zhao, Minjie Hua, Chaoyang Ma, Kai Wang, Shiguo Lian
Title: Safety Evaluation of DeepSeek Models in Chinese Contexts
Abstract:
Recently, the DeepSeek series of models, leveraging their exceptional reasoning capabilities and open-source strategy, is reshaping the global AI landscape. Despite these advantages, they exhibit significant safety deficiencies. Research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 has a 100\% attack success rate when processing harmful prompts. Additionally, multiple safety companies and research institutions have confirmed critical safety vulnerabilities in this model. As models demonstrating robust performance in Chinese and English, DeepSeek models require equally crucial safety assessments in both language contexts. However, current research has predominantly focused on safety evaluations in English environments, leaving a gap in comprehensive assessments of their safety performance in Chinese contexts. In response to this gap, this study introduces CHiSafetyBench, a Chinese-specific safety evaluation benchmark. This benchmark systematically evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts, revealing their performance across safety categories. The experimental results quantify the deficiencies of these two models in Chinese contexts, providing key insights for subsequent improvements. It should be noted that, despite our efforts to establish a comprehensive, objective, and authoritative evaluation benchmark, the selection of test samples, characteristics of data distribution, and the setting of evaluation criteria may inevitably introduce certain biases into the evaluation results. We will continuously optimize the evaluation benchmark and periodically update this report to provide more comprehensive and accurate assessment outcomes. Please refer to the latest version of the paper for the most recent evaluation results and conclusions.
中文: DeepSeek模型存在严重安全隐患,为此研究团队开发了CHiSafetyBench中文安全评估基准,系统评估发现模型在中文环境下存在明显安全缺陷,需后续改进。
English: The DeepSeek models exhibit critical safety vulnerabilities, prompting the creation of CHiSafetyBench to systematically evaluate their safety performance in Chinese contexts and reveal significant deficiencies requiring improvement.

Authors:Xiangyu Lu, Wang Xu, Haoyu Wang, Hongyun Zhou, Haiyan Zhao, Conghui Zhu, Tiejun Zhao, Muyun Yang
Title: DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities
Abstract:
Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations. Our code and model are released.
中文: DuplexMamba是一种基于Mamba的多模态模型,能够实现实时语音转文本对话,同时处理输入和生成输出,在自动语音识别任务中达到与基于Transformer的模型相当的性能。
English: DuplexMamba is a Mamba-based multimodal model that enables real-time speech-to-text conversation with simultaneous input processing and output generation, achieving performance comparable to Transformer-based models in ASR tasks.

Authors:Zhengyan Sheng, Zhihao Du, Shiliang Zhang, Zhijie Yan, Yexin Yang, Zhenhua Ling
Title: SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer
Abstract:
This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech, facilitating seamless interaction with large language models. SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step. To achieve this, we propose a temporal masked transformer as the backbone of SyncSpeech, combined with token-level duration prediction to predict speech tokens and the duration for the next step. Additionally, we design a two-stage training strategy to improve training efficiency and the quality of generated speech. We evaluated the SyncSpeech on both English and Mandarin datasets. Compared to the recent dual-stream TTS models, SyncSpeech significantly reduces the first packet delay of speech tokens and accelerates the real-time factor. Moreover, with the same data scale, SyncSpeech achieves performance comparable to that of traditional autoregressive-based TTS models in terms of both speech quality and robustness. Speech samples are available at https://SyncSpeech.github.io/}{https://SyncSpeech.github.io/.
中文: SyncSpeech是一种双流文本转语音模型,通过逐步处理文本标记实现低延迟的流式语音生成,在显著减少延迟的同时,其语音质量和鲁棒性可与传统自回归TTS模型相媲美。
English: SyncSpeech is a dual-stream text-to-speech model that enables low-latency streaming speech generation by processing text tokens incrementally, achieving performance comparable to traditional TTS models while significantly reducing delays.

Authors:Xu Shen, Yixin Liu, Yili Wang, Rui Miao, Yiwei Dai, Shirui Pan, Yi Chang, Xin Wang
Title: Raising the Bar in Graph OOD Generalization: Invariant Learning Beyond Explicit Environment Modeling
Abstract:
Out-of-distribution (OOD) generalization has emerged as a critical challenge in graph learning, as real-world graph data often exhibit diverse and shifting environments that traditional models fail to generalize across. A promising solution to address this issue is graph invariant learning (GIL), which aims to learn invariant representations by disentangling label-correlated invariant subgraphs from environment-specific subgraphs. However, existing GIL methods face two major challenges: (1) the difficulty of capturing and modeling diverse environments in graph data, and (2) the semantic cliff, where invariant subgraphs from different classes are difficult to distinguish, leading to poor class separability and increased misclassifications. To tackle these challenges, we propose a novel method termed Multi-Prototype Hyperspherical Invariant Learning (MPHIL), which introduces two key innovations: (1) hyperspherical invariant representation extraction, enabling robust and highly discriminative hyperspherical invariant feature extraction, and (2) multi-prototype hyperspherical classification, which employs class prototypes as intermediate variables to eliminate the need for explicit environment modeling in GIL and mitigate the semantic cliff issue. Derived from the theoretical framework of GIL, we introduce two novel objective functions: the invariant prototype matching loss to ensure samples are matched to the correct class prototypes, and the prototype separation loss to increase the distinction between prototypes of different classes in the hyperspherical space. Extensive experiments on 11 OOD generalization benchmark datasets demonstrate that MPHIL achieves state-of-the-art performance, significantly outperforming existing methods across graph data from various domains and with different distribution shifts.
中文摘要:图不变学习面临环境多样性建模和语义悬崖两大挑战,所提出的MPHIL方法通过超球面不变表示提取和多原型分类技术,在分布外泛化任务中实现了最先进的性能表现。
English Summary: Graph invariant learning faces challenges in modeling diverse environments and overcoming the semantic cliff, which the proposed MPHIL method addresses through hyperspherical invariant representation extraction and multi-prototype classification to achieve state-of-the-art OOD generalization performance.

Authors:Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S. Corrado, Dale R. Webster, Shravya Shetty, Shruthi Prabhakara, Yun Liu, Daniel Golden, Ellery Wulczyn, David F. Steiner
Title: PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation
Abstract:
The interpretation of histopathology cases underlies many important diagnostic and treatment decisions in medicine. Notably, this process typically requires pathologists to integrate and summarize findings across multiple slides per case. Existing vision-language capabilities in computational pathology have so far been largely limited to small regions of interest, larger regions at low magnification, or single whole-slide images (WSIs). This limits interpretation of findings that span multiple high-magnification regions across multiple WSIs. By making use of Gemini 1.5 Flash, a large multimodal model (LMM) with a 1-million token context window, we demonstrate the ability to generate bottom-line diagnoses from up to 40,000 768x768 pixel image patches from multiple WSIs at 10X magnification. This is the equivalent of up to 11 hours of video at 1 fps. Expert pathologist evaluations demonstrate that the generated report text is clinically accurate and equivalent to or preferred over the original reporting for 68% (95% CI: [60%, 76%]) of multi-slide examples with up to 5 slides. While performance decreased for examples with 6 or more slides, this study demonstrates the promise of leveraging the long-context capabilities of modern LMMs for the uniquely challenging task of medical report generation where each case can contain thousands of image patches.
中文: 本研究证明,Gemini 1.5 Flash模型能够通过分析多个全切片图像中多达4万个高倍率图像区块来生成临床精准的诊断报告,在68%的多切片案例中,病理学家认为AI生成的报告优于或等同于原始报告。
English: This study demonstrates that the Gemini 1.5 Flash model can generate clinically accurate diagnostic reports by analyzing up to 40,000 high-magnification image patches across multiple whole-slide images, with pathologists preferring or equating these AI-generated reports to original ones in 68% of multi-slide cases.

Authors:Rishabh Tiwari, Haocheng Xi, Aditya Tomar, Coleman Hooper, Sehoon Kim, Maxwell Horton, Mahyar Najibi, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
Title: QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
Abstract:
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings, creating a growing need for fast and efficient long-context inference. In these scenarios, the Key-Value (KV) cache is the primary bottleneck in terms of both GPU memory and latency, as the full KV cache must be loaded for each decoding step. While speculative decoding is a widely accepted technique to accelerate autoregressive decoding, existing methods often struggle to achieve significant speedups due to inefficient KV cache optimization strategies and result in low acceptance rates. To address these challenges, we propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration. QuantSpec maintains high acceptance rates ($>$90%) and reliably provides consistent end-to-end speedups upto $\sim2.5\times$, outperforming other self-speculative decoding methods that use sparse KV cache for long-context LLM inference. QuantSpec also reduces the memory requirements by $\sim 1.3\times$ compared to these alternatives.
中文摘要:QuantSpec提出了一种自推测解码框架,通过分层4位量化的KV缓存和权重来加速长上下文大语言模型推理,在保持超过90%接受率的同时实现高达2.5倍的端到端加速,相比稀疏KV缓存方法还能降低约1.3倍内存需求。
English Summary: QuantSpec introduces a self-speculative decoding framework using hierarchical 4-bit quantization of KV cache and weights to accelerate long-context LLM inference, achieving up to 2.5× speedup with over 90% acceptance rate while reducing memory usage by 1.3× compared to sparse KV cache methods.

Authors:Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
Title: Large Language Diffusion Models
Abstract:
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.
Chinese: LLaDA作为一种基于扩散的模型,通过在语言任务中展现竞争力并解决如逆向诅咒等问题,挑战了自回归模型的地位,为大型语言模型提供了可行的替代方案。
English: LLaDA, a diffusion-based model, challenges autoregressive models by demonstrating competitive performance in language tasks and addressing issues like the reversal curse, offering a viable alternative for large language models.

Authors:Naoki Chihara, Yasuko Matsubara, Ren Fujiwara, Yasushi Sakurai
Title: Modeling Time-evolving Causality over Data Streams
Abstract:
Given an extensive, semi-infinite collection of multivariate coevolving data sequences (e.g., sensor/web activity streams) whose observations influence each other, how can we discover the time-changing cause-and-effect relationships in co-evolving data streams? How efficiently can we reveal dynamical patterns that allow us to forecast future values? In this paper, we present a novel streaming method, ModePlait, which is designed for modeling such causal relationships (i.e., time-evolving causality) in multivariate co-evolving data streams and forecasting their future values. The solution relies on characteristics of the causal relationships that evolve over time in accordance with the dynamic changes of exogenous variables. ModePlait has the following properties: (a) Effective: it discovers the time-evolving causality in multivariate co-evolving data streams by detecting the transitions of distinct dynamical patterns adaptively. (b) Accurate: it enables both the discovery of time-evolving causality and the forecasting of future values in a streaming fashion. (c) Scalable: our algorithm does not depend on data stream length and thus is applicable to very large sequences. Extensive experiments on both synthetic and real-world datasets demonstrate that our proposed model outperforms state-of-the-art methods in terms of discovering the time-evolving causality as well as forecasting.
中文: 本文提出ModePlait这一新型流式方法,能有效发现多元协同演化数据流中的时变因果关系并准确预测未来值,实验证明其在因果关系发现和预测方面均优于现有先进方法。
English: This paper introduces ModePlait, a novel streaming method that effectively discovers time-evolving causal relationships in multivariate co-evolving data streams and accurately forecasts future values, demonstrating superior performance over state-of-the-art methods in experiments.

Authors:Xiaoshen Han, Minghuan Liu, Yilun Chen, Junqiu Yu, Xiaoyang Lyu, Yang Tian, Bolun Wang, Weinan Zhang, Jiangmiao Pang
Title: Re$^3$Sim: Generating High-Fidelity Simulation Data via 3D-Photorealistic Real-to-Sim for Robotic Manipulation
Abstract:
Real-world data collection for robotics is costly and resource-intensive, requiring skilled operators and expensive hardware. Simulations offer a scalable alternative but often fail to achieve sim-to-real generalization due to geometric and visual gaps. To address these challenges, we propose a 3D-photorealistic real-to-sim system, namely, RE$^3$SIM, addressing geometric and visual sim-to-real gaps. RE$^3$SIM employs advanced 3D reconstruction and neural rendering techniques to faithfully recreate real-world scenarios, enabling real-time rendering of simulated cross-view cameras within a physics-based simulator. By utilizing privileged information to collect expert demonstrations efficiently in simulation, and train robot policies with imitation learning, we validate the effectiveness of the real-to-sim-to-real pipeline across various manipulation task scenarios. Notably, with only simulated data, we can achieve zero-shot sim-to-real transfer with an average success rate exceeding 58%. To push the limit of real-to-sim, we further generate a large-scale simulation dataset, demonstrating how a robust policy can be built from simulation data that generalizes across various objects. Codes and demos are available at: http://xshenhan.github.io/Re3Sim/.
Chinese: RE³SIM系统通过三维重建和神经渲染技术创建逼真模拟环境,有效缩小仿真与现实的差距,利用模仿学习训练机器人策略,在零样本条件下实现超过58%的成功率迁移至现实任务。
English: The RE³SIM system overcomes the sim-to-real gap in robotics by using 3D reconstruction and neural rendering to create photorealistic simulations, enabling effective policy training with imitation learning and achieving over 58% zero-shot transfer success to real-world tasks.

Authors:Qingyuan Wu, Jianheng Liu, Jianye Hao, Jun Wang, Kun Shao
Title: Advancing Autonomous VLM Agents via Variational Subgoal-Conditioned Reinforcement Learning
Abstract:
State-of-the-art (SOTA) reinforcement learning (RL) methods have enabled vision-language model (VLM) agents to learn from interaction with online environments without human supervision. However, these methods often struggle with learning inefficiencies when applied to complex, real-world decision-making tasks with sparse rewards and long-horizon dependencies. We propose a novel framework, Variational Subgoal-Conditioned Reinforcement Learning (VSC-RL), advancing the VLM agents in resolving challenging decision-making tasks. Fundamentally distinct from existing methods, VSC-RL reformulates the decision-making problem as a variational subgoal-conditioned RL problem with the newly derived optimization objective, Subgoal Evidence Lower BOund (SGC-ELBO), which comprises two key components: (a) maximizing the subgoal-conditioned return, and (b) minimizing the divergence from a reference goal-conditioned policy. We theoretically and empirically demonstrate that the VSC-RL can efficiently improve the learning efficiency without compromising performance guarantees. Across a diverse set of challenging benchmarks, including mobile device and web control tasks, VSC-RL consistently outperforms existing SOTA methods, achieving superior learning efficiency and performance.
Chinese Summary: 提出的VSC-RL框架通过将强化学习重新定义为具有新型优化目标的变分子目标条件问题,有效提升了视觉语言模型代理在复杂任务中的决策效率。
English Summary: The proposed VSC-RL framework enhances vision-language model agents' decision-making efficiency in complex tasks by reformulating reinforcement learning as a variational subgoal-conditioned problem with a novel optimization objective.

Authors:Jasmine Chiat Ling Ong, Yilin Ning, Mingxuan Liu, Yian Ma, Zhao Liang, Kuldev Singh, Robert T Chang, Silke Vogel, John CW Lim, Iris Siu Kwan Tan, Oscar Freyer, Stephen Gilbert, Danielle S Bitterman, Xiaoxuan Liu, Alastair K Denniston, Nan Liu
Title: Regulatory Science Innovation for Generative AI and Large Language Models in Health and Medicine: A Global Call for Action
Abstract:
The integration of generative AI (GenAI) and large language models (LLMs) in healthcare presents both unprecedented opportunities and challenges, necessitating innovative regulatory approaches. GenAI and LLMs offer broad applications, from automating clinical workflows to personalizing diagnostics. However, the non-deterministic outputs, broad functionalities and complex integration of GenAI and LLMs challenge existing medical device regulatory frameworks, including the total product life cycle (TPLC) approach. Here we discuss the constraints of the TPLC approach to GenAI and LLM-based medical device regulation, and advocate for global collaboration in regulatory science research. This serves as the foundation for developing innovative approaches including adaptive policies and regulatory sandboxes, to test and refine governance in real-world settings. International harmonization, as seen with the International Medical Device Regulators Forum, is essential to manage implications of LLM on global health, including risks of widening health inequities driven by inherent model biases. By engaging multidisciplinary expertise, prioritizing iterative, data-driven approaches, and focusing on the needs of diverse populations, global regulatory science research enables the responsible and equitable advancement of LLM innovations in healthcare.
Chinese: 生成式AI和大语言模型在医疗领域的整合带来了巨大机遇,但也对现有监管框架构成挑战,需要通过全球协作和适应性政策来推动负责任且公平的创新。
English: The integration of generative AI and large language models in healthcare offers significant opportunities but challenges current regulatory frameworks, requiring global collaboration and adaptive policies to ensure responsible and equitable innovation.

Authors:Shuzheng Si, Haozhe Zhao, Gang Chen, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Kaikai An, Kangyang Luo, Chen Qian, Fanchao Qi, Baobao Chang, Maosong Sun
Title: Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering
Abstract:
Training LLMs on data containing unfamiliar knowledge during the instruction tuning stage can encourage hallucinations. To address this challenge, we introduce NOVA, a novel framework designed to identify high-quality data that aligns well with the LLM's learned knowledge to reduce hallucinations. NOVA includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. Specifically, ICP evaluates the LLM's understanding of the given instruction by calculating the tailored consistency among multiple self-generated responses. SEI further assesses the familiarity of the LLM with the target response by comparing it to the generated responses, using the proposed semantic clustering and well-designed voting strategy. Finally, to ensure the quality of selected samples, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity. By considering data quality and avoiding unfamiliar data, we can utilize the selected data to effectively align LLMs to follow instructions and hallucinate less.
中文摘要:在指令微调阶段对LLMs使用包含陌生知识的数据会引发幻觉,因此NOVA框架通过筛选与模型已学知识一致的高质量数据来有效减少幻觉现象。
English Summary: Training LLMs on unfamiliar data during instruction tuning can cause hallucinations, so the NOVA framework identifies high-quality data aligned with the model's knowledge to reduce this issue.

Authors:William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, Nicholas D. Lane
Title: LUNAR: LLM Unlearning via Neural Activation Redirection
Abstract:
Large Language Models (LLMs) benefit from training on ever larger amounts of textual data, but as a result, they increasingly incur the risk of leaking private information. The ability to selectively remove knowledge from LLMs is, therefore, a highly desirable capability. In this paper, we propose LUNAR, a novel unlearning methodology grounded in the Linear Representation Hypothesis. LUNAR operates by redirecting the representations of unlearned data to regions that trigger the model's inherent ability to express its inability to answer. LUNAR achieves state-of-the-art unlearning performance while significantly enhancing the controllability of the unlearned model during inference. Specifically, LUNAR achieves between 2.9x to 11.7x improvements on combined "unlearning efficacy" and "model utility" score ("Deviation Score") on the PISTOL dataset across various base models. We also demonstrate, through quantitative analysis and qualitative examples, LUNAR's superior controllability in generating coherent and contextually aware responses, mitigating undesired side effects of existing methods. Moreover, we demonstrate that LUNAR is robust against white-box adversarial attacks and versatile in handling real-world scenarios, such as processing sequential unlearning requests.
中文:LUNAR是一种新颖的遗忘方法,通过将未学习数据的表征重定向到表达无法回答的激活区域,实现了最先进的遗忘性能、卓越的可控性以及更高的效率和鲁棒性。
English: LUNAR is a novel unlearning method that redirects representations of unlearned data to activation regions expressing inability to answer, achieving state-of-the-art performance, superior controllability, and enhanced efficiency and robustness.

Authors:William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, Nicholas D. Lane
Title: LLM Unlearning via Neural Activation Redirection
Abstract:
The ability to selectively remove knowledge from LLMs is highly desirable. However, existing methods often struggle with balancing unlearning efficacy and retain model utility, and lack controllability at inference time to emulate base model behavior as if it had never seen the unlearned data. In this paper, we propose LUNAR, a novel unlearning method grounded in the Linear Representation Hypothesis and operates by redirecting the representations of unlearned data to activation regions that expresses its inability to answer. We show that contrastive features are not a prerequisite for effective activation redirection, and LUNAR achieves state-of-the-art unlearning performance and superior controllability. Specifically, LUNAR achieves between 2.9x and 11.7x improvement in the combined unlearning efficacy and model utility score (Deviation Score) across various base models and generates coherent, contextually appropriate responses post-unlearning. Moreover, LUNAR effectively reduces parameter updates to a single down-projection matrix, a novel design that significantly enhances efficiency by 20x and robustness. Finally, we demonstrate that LUNAR is robust to white-box adversarial attacks and versatile in real-world scenarios, including handling sequential unlearning requests.
中文:LUNAR是一种新颖的遗忘方法,通过将未学习数据的表征重定向到表达无法回答的激活区域,实现了最先进的遗忘性能、卓越的可控性以及更高的效率和鲁棒性。
English: LUNAR is a novel unlearning method that redirects representations of unlearned data to activation regions expressing inability to answer, achieving state-of-the-art performance, superior controllability, and enhanced efficiency and robustness.

Authors:David Noever, Forrest McKee
Title: Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests
Abstract:
The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the most conservative approach with 73% refusals and 27% allowances, while Mistral attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and 80% allowances. Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse.
Chinese: 本研究提出开源数据集与测试框架评估大语言模型安全机制,发现各模型拒绝率差异显著——Claude-3.5-sonnet达73%而Mistral为零,揭示了安全防护与科学自由之间的平衡难题。
English: This study introduces an open-source dataset and framework to evaluate LLM safety mechanisms, revealing varying refusal rates across models—from Claude-3.5-sonnet's 73% to Mistral's 0%—and highlighting the challenge of balancing safety with scientific freedom.

Authors:Bojia Zi, Penghui Ruan, Marco Chen, Xianbiao Qi, Shaozhe Hao, Shihao Zhao, Youze Huang, Bin Liang, Rong Xiao, Kam-Fai Wong
Title: Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists
Abstract:
Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training-free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 millions of video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results. More details are available at https://senorita-2m-dataset.github.io.
中文: 本文提出了Señorita-2M高质量视频编辑数据集,通过提供200万精炼视频对并探索最优编辑架构,有效解决了端到端方法因训练数据不足导致的编辑效果不佳问题,显著提升了视频编辑质量。
English: This paper introduces Señorita-2M, a high-quality video editing dataset designed to overcome the limitations of existing end-to-end methods by providing 2 million refined video pairs and identifying optimal editing architectures, resulting in superior video editing outcomes.

Authors:Georgios Papoudakis, Thomas Coste, Zhihao Wu, Jianye Hao, Jun Wang, Kun Shao
Title: AppVLM: A Lightweight Vision Language Model for Online App Control
Abstract:
The utilisation of foundation models as smartphone assistants, termed app agents, is a critical research challenge. These agents aim to execute human instructions on smartphones by interpreting textual instructions and performing actions via the device's interface. While promising, current approaches face significant limitations. Methods that use large proprietary models, such as GPT-4o, are computationally expensive, while those that use smaller fine-tuned models often lack adaptability to out-of-distribution tasks. In this work, we introduce AppVLM, a lightweight Vision-Language Model (VLM). First, we fine-tune it offline on the AndroidControl dataset. Then, we refine its policy by collecting data from the AndroidWorld environment and performing further training iterations. Our results indicate that AppVLM achieves the highest action prediction accuracy in offline evaluation on the AndroidControl dataset, compared to all evaluated baselines, and matches GPT-4o in online task completion success rate in the AndroidWorld environment, while being up to ten times faster. This makes AppVLM a practical and efficient solution for real-world deployment.
中文: AppVLM是一种轻量级视觉语言模型,在离线评估中达到最高动作预测准确率,在线任务成功率与GPT-4o相当但速度快十倍,为智能手机应用代理提供了实用高效的解决方案。
English: AppVLM is a lightweight vision-language model that achieves top action prediction accuracy offline and matches GPT-4o's task success rate online while being ten times faster, offering a practical solution for smartphone app agents.

Authors:Chaoqun Liu, Wenxuan Zhang, Jiahao Ying, Mahani Aljunied, Anh Tuan Luu, Lidong Bing
Title: SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia
Abstract:
This study introduces two novel benchmarks, SeaExam and SeaBench, designed to evaluate the capabilities of Large Language Models (LLMs) in Southeast Asian (SEA) application scenarios. Unlike existing multilingual datasets primarily derived from English translations, these benchmarks are constructed based on real-world scenarios from SEA regions. SeaExam draws from regional educational exams to form a comprehensive dataset that encompasses subjects such as local history and literature. In contrast, SeaBench is crafted around multi-turn, open-ended tasks that reflect daily interactions within SEA communities. Our evaluations demonstrate that SeaExam and SeaBench more effectively discern LLM performance on SEA language tasks compared to their translated benchmarks. This highlights the importance of using real-world queries to assess the multilingual capabilities of LLMs.
中文摘要:本研究提出了基于东南亚真实场景构建的SeaExam和SeaBench两大基准,相比翻译数据集能更有效评估大语言模型在当地区域性任务中的实际表现。
English Summary: This research introduces SeaExam and SeaBench, two benchmarks developed from Southeast Asian real-world contexts to more accurately evaluate Large Language Models' performance on regional languages and tasks than translation-based datasets.

Authors:Zeman Li, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni
Title: PiKE: Adaptive Data Mixing for Large-Scale Multi-Task Learning Under Low Gradient Conflicts
Abstract:
Modern foundation models are trained on diverse datasets to enhance generalization across tasks and domains A central challenge in this process is determining how to effectively mix and sample data from multiple sources This naturally leads to a multitask learning (MTL) perspective While prior work in MTL has emphasized mitigating gradient conflicts we observe that largescale pretraining scenariossuch as multilingual or multidomain trainingoften exhibit little to no gradient conflict Motivated by this observation we propose PiKE (Positive gradient interaction-based K-task weights Estimator) an adaptive data mixing algorithm that dynamically adjusts sampling weights during training PiKE exploits nonconflicting gradient interactions to minimize a neartight upper bound on the average loss decrease at each step while incurring negligible computational overhead We provide theoretical convergence guarantees and show that PiKE outperforms static and nonadaptive mixing baselines Furthermore we extend PiKE to promote balanced learning across tasks Extensive experiments on largescale language model pretraining confirm that PiKE achieves faster convergence and improved downstream performance compared to existing approaches
中文摘要:PiKE是一种自适应数据混合算法,通过利用正梯度交互动态调整采样权重,在大规模预训练中实现更快的收敛和更优的下游性能,优于静态和非自适应基线方法。
English Summary: PiKE is an adaptive data mixing algorithm that dynamically adjusts sampling weights by leveraging positive gradient interactions to enhance convergence and performance in large-scale pretraining, outperforming static and nonadaptive methods.

Authors:Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, Hao Zhang
Title: Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
Abstract:
Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process by adopting existing multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x -7.8x faster for 29 and 93 frames 720p video generation with a marginal performance trade-off in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.
中文: 本文提出一种高效视频生成方法,通过引入稀疏3D注意力降低计算复杂度,并采用多步一致性蒸馏加速采样过程,在保持视频质量的同时实现了最高7.8倍的加速效果。
English: This paper proposes an efficient video generation method by introducing sparse 3D attention to reduce computational complexity and employing multi-step consistency distillation to accelerate sampling, achieving up to 7.8x speedup with minimal quality loss.

Authors:Ivan Baburin, Matthew Cook, Florian Grötschla, Andreas Plesner, Roger Wattenhofer
Title: Universality Frontier for Asynchronous Cellular Automata
Abstract:
In this work, we investigate the computational aspects of asynchronous cellular automata (ACAs), a modification of cellular automata in which cells update independently, following an asynchronous schedule. We introduce flip automata networks (FAN), a simple modification of automata networks that remain robust under any asynchronous update schedule. We show that asynchronous automata can efficiently simulate their synchronous counterparts with a linear memory overhead, which improves upon the previously established quadratic bound. Additionally, we address the universality gap for (a)synchronous cellular automata -- the boundary separating universal and non-universal automata, which is still not fully understood. We tighten this boundary by proving that all one-way asynchronous automata lack universal computational power. Conversely, we establish the existence of a universal 6-state first-neighbor automaton in one dimension and a 3-state von Neumann automaton in two dimensions, which represent the smallest known universal constructions to date.
中文: 本研究证明异步元胞自动机能够以线性内存开销模拟同步版本,通过证明单向异步自动机不具备通用性来收紧通用性边界,并提出了目前已知最小的通用自动机:一维6状态和二维3状态结构。
English: This study demonstrates that asynchronous cellular automata can simulate synchronous versions with linear memory overhead, tightens the universality boundary by proving one-way asynchronous automata are non-universal, and presents the smallest known universal automata with 6 states in 1D and 3 states in 2D.

Authors:Nan Qi, Haoxuan Liu, Theodoros A. Tsiftsis, Alexandros-Apostolos A. Boulogeorgos, Fuhui Zhou, Shi Jin, Qihui Wu
Title: Coalition Formation for Heterogeneous Federated Learning Enabled Channel Estimation in RIS-assisted Cell-free MIMO
Abstract:
Downlink channel estimation remains a significant bottleneck in reconfigurable intelligent surface-assisted cell-free multiple-input multiple-output communication systems. Conventional approaches primarily rely on centralized deep learning methods to estimate the high-dimensional and complex cascaded channels. These methods require data aggregation from all users for centralized model training, leading to excessive communication overhead and significant data privacy concerns. Additionally, the large size of local learning models imposes heavy computational demands on end users, necessitating strong computational capabilities that most commercial devices lack. To address the aforementioned challenges, a coalition-formation-guided heterogeneous federated learning (FL) framework is proposed. This framework leverages coalition formation to guide the formation of heterogeneous FL user groups for efficient channel estimation. Specifically, by utilizing a distributed deep reinforcement learning (DRL) approach, each FL user intelligently and independently decides whether to join or leave a coalition, aiming at improving channel estimation accuracy, while reducing local model size and computational costs for end users. Moreover, to accelerate the DRL-FL convergence process and reduce computational burdens on end users, a transfer learning method is introduced. This method incorporates both received reference signal power and distance similarity metrics, by considering that nodes with similar distances to the base station and comparable received signal power have a strong likelihood of experiencing similar channel fading. Massive experiments performed that reveal that, compared with the benchmarks, the proposed framework significantly reduces the computational overhead of end users by 16%, improves data privacy, and improves channel estimation accuracy by 20%.
中文: 针对智能超表面辅助的无蜂窝MIMO系统中的下行信道估计难题,提出了联盟形成引导的异构联邦学习框架,在提升20%估计精度的同时降低16%计算开销,并有效增强了数据隐私保护。
English: A coalition-formation-guided heterogeneous federated learning framework is proposed to address downlink channel estimation challenges in RIS-assisted cell-free MIMO systems, reducing computational overhead by 16% and improving estimation accuracy by 20% while enhancing data privacy.

Authors:Muhammad Imran, Jonathan R. Krebs, Vishal Balaji Sivaraman, Teng Zhang, Amarjeet Kumar, Walker R. Ueland, Michael J. Fassler, Jinlong Huang, Xiao Sun, Lisheng Wang, Pengcheng Shi, Maximilian Rokuss, Michael Baumgartner, Yannick Kirchhof, Klaus H. Maier-Hein, Fabian Isensee, Shuolin Liu, Bing Han, Bong Thanh Nguyen, Dong-jin Shin, Park Ji-Woo, Mathew Choi, Kwang-Hyun Uhm, Sung-Jea Ko, Chanwoong Lee, Jaehee Chun, Jin Sung Kim, Minghui Zhang, Hanxiao Zhang, Xin You, Yun Gu, Zhaohong Pan, Xuan Liu, Xiaokun Liang, Markus Tiefenthaler, Enrique Almar-Munoz, Matthias Schwab, Mikhail Kotyushev, Rostislav Epifanov, Marek Wodzinski, Henning Muller, Abdul Qayyum, Moona Mazher, Steven A. Niederer, Zhiwei Wang, Kaixiang Yang, Jintao Ren, Stine Sofia Korreman, Yuchong Gao, Hongye Zeng, Haoyu Zheng, Rui Zheng, Jinghua Yue, Fugen Zhou, Bo Liu, Alexander Cosman, Muxuan Liang, Chang Zhao, Gilbert R. Upchurch, Jun Ma, Yuyin Zhou, Michol A. Cooper, Wei Shao
Title: Multi-Class Segmentation of Aortic Branches and Zones in Computed Tomography Angiography: The AortaSeg24 Challenge
Abstract:
Multi-class segmentation of the aorta in computed tomography angiography (CTA) scans is essential for diagnosing and planning complex endovascular treatments for patients with aortic dissections. However, existing methods reduce aortic segmentation to a binary problem, limiting their ability to measure diameters across different branches and zones. Furthermore, no open-source dataset is currently available to support the development of multi-class aortic segmentation methods. To address this gap, we organized the AortaSeg24 MICCAI Challenge, introducing the first dataset of 100 CTA volumes annotated for 23 clinically relevant aortic branches and zones. This dataset was designed to facilitate both model development and validation. The challenge attracted 121 teams worldwide, with participants leveraging state-of-the-art frameworks such as nnU-Net and exploring novel techniques, including cascaded models, data augmentation strategies, and custom loss functions. We evaluated the submitted algorithms using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD), highlighting the approaches adopted by the top five performing teams. This paper presents the challenge design, dataset details, evaluation metrics, and an in-depth analysis of the top-performing algorithms. The annotated dataset, evaluation code, and implementations of the leading methods are publicly available to support further research. All resources can be accessed at https://aortaseg24.grand-challenge.org.
中文: AortaSeg24 MICCAI挑战赛发布了首个包含100个CTA扫描和23个主动脉分支标注的开源数据集,推动了多类别分割研究,121支参赛团队的最佳方案已公开共享。
English: The AortaSeg24 MICCAI Challenge introduced the first open-source dataset of 100 CTA volumes with 23 annotated aortic branches and zones to advance multi-class segmentation, attracting 121 teams whose top-performing methods are now publicly available.

Authors:Runqing Jiang, Ye Zhang, Longguang Wang, Pengpeng Yu, Yulan Guo
Title: AIQViT: Architecture-Informed Post-Training Quantization for Vision Transformers
Abstract:
Post-training quantization (PTQ) has emerged as a promising solution for reducing the storage and computational cost of vision transformers (ViTs). Recent advances primarily target at crafting quantizers to deal with peculiar activations characterized by ViTs. However, most existing methods underestimate the information loss incurred by weight quantization, resulting in significant performance deterioration, particularly in low-bit cases. Furthermore, a common practice in quantizing post-Softmax activations of ViTs is to employ logarithmic transformations, which unfortunately prioritize less informative values around zero. This approach introduces additional redundancies, ultimately leading to suboptimal quantization efficacy. To handle these, this paper proposes an innovative PTQ method tailored for ViTs, termed AIQViT (Architecture-Informed Post-training Quantization for ViTs). First, we design an architecture-informed low rank compensation mechanism, wherein learnable low-rank weights are introduced to compensate for the degradation caused by weight quantization. Second, we design a dynamic focusing quantizer to accommodate the unbalanced distribution of post-Softmax activations, which dynamically selects the most valuable interval for higher quantization resolution. Extensive experiments on five vision tasks, including image classification, object detection, instance segmentation, point cloud classification, and point cloud part segmentation, demonstrate the superiority of AIQViT over state-of-the-art PTQ methods.
中文: 本文提出AIQViT方法,通过架构感知的低秩补偿机制缓解权重量化损失,并采用动态聚焦量化器优化激活值分布,在五项视觉任务中均超越现有PTQ方法。
English: This paper introduces AIQViT, a novel post-training quantization method for vision transformers that addresses weight quantization degradation through low-rank compensation and optimizes activation quantization with a dynamic focusing quantizer, achieving superior performance across multiple vision tasks.

Authors:Jaehan Im, Filippos Fotiadis, Daniel Delahaye, Ufuk Topcu, David Fridovich-Keil
Title: Noncooperative Equilibrium Selection via a Trading-based Auction
Abstract:
Noncooperative multi-agent systems often face coordination challenges due to conflicting preferences among agents. In particular, agents acting in their own self-interest can settle on different equilibria, leading to suboptimal outcomes or even safety concerns. We propose an algorithm named trading auction for consensus (TACo), a decentralized approach that enables noncooperative agents to reach consensus without communicating directly or disclosing private valuations. TACo facilitates coordination through a structured trading-based auction, where agents iteratively select choices of interest and provably reach an agreement within an a priori bounded number of steps. A series of numerical experiments validate that the termination guarantees of TACo hold in practice, and show that TACo achieves a median performance that minimizes the total cost across all agents, while allocating resources significantly more fairly than baseline approaches.
中文: 提出的TACo算法通过去中心化的交易拍卖机制,使非合作智能体无需直接通信即可达成共识,在有限步骤内实现公平资源分配并最小化总成本。
English: The proposed TACo algorithm enables noncooperative agents to reach consensus through a decentralized trading auction without direct communication, achieving fair resource allocation and minimized total cost within bounded steps.

Authors:Yazid Janati, Badr Moufad, Mehdi Abou El Qassime, Alain Durmus, Eric Moulines, Jimmy Olsson
Title: A Mixture-Based Framework for Guiding Diffusion Models
Abstract:
Denoising diffusion models have driven significant progress in the field of Bayesian inverse problems. Recent approaches use pre-trained diffusion models as priors to solve a wide range of such problems, only leveraging inference-time compute and thereby eliminating the need to retrain task-specific models on the same dataset. To approximate the posterior of a Bayesian inverse problem, a diffusion model samples from a sequence of intermediate posterior distributions, each with an intractable likelihood function. This work proposes a novel mixture approximation of these intermediate distributions. Since direct gradient-based sampling of these mixtures is infeasible due to intractable terms, we propose a practical method based on Gibbs sampling. We validate our approach through extensive experiments on image inverse problems, utilizing both pixel- and latent-space diffusion priors, as well as on source separation with an audio diffusion model. The code is available at https://www.github.com/badr-moufad/mgdm
中文: 本文针对贝叶斯逆问题中的中间后验分布提出了一种新颖的混合近似方法,通过吉布斯采样解决似然函数难处理的问题,并在图像和音频应用中验证了该方法的有效性。
English: This paper introduces a novel mixture approximation method for intermediate posterior distributions in Bayesian inverse problems using denoising diffusion models, employing Gibbs sampling to overcome intractable likelihood terms and demonstrating effectiveness across image and audio applications.

Authors:Xinyao Liao, Xianfang Zeng, Liao Wang, Gang Yu, Guosheng Lin, Chi Zhang
Title: MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent
Abstract:
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
中文:MotionAgent通过运动场代理将文本中的运动信息转化为显式运动场,结合物体轨迹和相机参数生成统一光流,从而在图像到视频生成中实现精细运动控制,显著提升了视频与文本运动对齐的准确性。
English: MotionAgent introduces a motion field agent that converts text-based motion cues into explicit motion fields, enabling precise control in image-to-video generation by integrating object trajectories and camera extrinsics into optical flow for enhanced video-text alignment.

Authors:Yuancheng Wang, Jiachen Zheng, Junan Zhang, Xueyao Zhang, Huan Liao, Zhizheng Wu
Title: Metis: A Foundation Speech Generation Model with Masked Generative Pre-training
Abstract:
We introduce Metis, a foundation model for unified speech generation. Unlike previous task-specific or multi-task models, Metis follows a pre-training and fine-tuning paradigm. It is pre-trained on large-scale unlabeled speech data using masked generative modeling and then fine-tuned to adapt to diverse speech generation tasks. Specifically, 1) Metis utilizes two discrete speech representations: SSL tokens derived from speech self-supervised learning (SSL) features, and acoustic tokens directly quantized from waveforms. 2) Metis performs masked generative pre-training on SSL tokens, utilizing 300K hours of diverse speech data, without any additional condition. 3) Through fine-tuning with task-specific conditions, Metis achieves efficient adaptation to various speech generation tasks while supporting multimodal input, even when using limited data and trainable parameters. Experiments demonstrate that Metis can serve as a foundation model for unified speech generation: Metis outperforms state-of-the-art task-specific or multi-task systems across five speech generation tasks, including zero-shot text-to-speech, voice conversion, target speaker extraction, speech enhancement, and lip-to-speech, even with fewer than 20M trainable parameters or 300 times less training data. Audio samples are are available at https://metis-demo.github.io/.
Chinese Summary: Metis是一种统一语音生成的基础模型,通过大规模无标签语音数据的预训练和任务特定微调,在五项语音生成任务中超越现有最优系统,且仅需极少参数和训练数据。
English Summary: Metis is a foundation model for unified speech generation that uses pre-training on large-scale unlabeled speech data and fine-tuning for diverse tasks, outperforming state-of-the-art systems across five speech generation tasks with minimal parameters and data.

Authors:Rui Chen, Yifan Sun, Changliu Liu
Title: Dexterous Safe Control for Humanoids in Cluttered Environments via Projected Safe Set Algorithm
Abstract:
It is critical to ensure safety for humanoid robots in real-world applications without compromising performance. In this paper, we consider the problem of dexterous safety, featuring limb-level geometry constraints for avoiding both external and self-collisions in cluttered environments. Compared to safety with simplified bounding geometries in sprase environments, dexterous safety produces numerous constraints which often lead to infeasible constraint sets when solving for safe robot control. To address this issue, we propose Projected Safe Set Algorithm (p-SSA), an extension of classical safe control algorithms to multi-constraint cases. p-SSA relaxes conflicting constraints in a principled manner, minimizing safety violations to guarantee feasible robot control. We verify our approach in simulation and on a real Unitree G1 humanoid robot performing complex collision avoidance tasks. Results show that p-SSA enables the humanoid to operate robustly in challenging situations with minimal safety violations and directly generalizes to various tasks with zero parameter tuning.
中文摘要:本文提出投影安全集算法(p-SSA),使人形机器人能够在复杂环境中通过协调多约束条件实现灵巧安全操作,在保证控制可行性的同时将安全违规降至最低。
English Summary: This paper introduces the Projected Safe Set Algorithm (p-SSA), which enables humanoid robots to maintain dexterous safety by managing multiple collision constraints in cluttered environments while ensuring feasible control with minimal safety violations.

Authors:Xiaomeng Yang, Zhiyu Tan, Hao Li
Title: IPO: Iterative Preference Optimization for Text-to-Video Generation
Abstract:
Video foundation models have achieved significant advancement with the help of network upgrade as well as model scale-up. However, they are still hard to meet requirements of applications due to unsatisfied generation quality. To solve this problem, we propose to align video foundation models with human preferences from the perspective of post-training in this paper. Consequently, we introduce an Iterative Preference Optimization strategy to enhance generated video quality by incorporating human feedback. Specifically, IPO exploits a critic model to justify video generations for pairwise ranking as in Direct Preference Optimization or point-wise scoring as in Kahneman-Tversky Optimization. Given this, IPO optimizes video foundation models with guidance of signals from preference feedback, which helps improve generated video quality in subject consistency, motion smoothness and aesthetic quality, etc. In addition, IPO incorporates the critic model with the multi-modality large language model, which enables it to automatically assign preference labels without need of retraining or relabeling. In this way, IPO can efficiently perform multi-round preference optimization in an iterative manner, without the need of tediously manual labeling. Comprehensive experiments demonstrate that the proposed IPO can effectively improve the video generation quality of a pretrained model and help a model with only 2B parameters surpass the one with 5B parameters. Besides, IPO achieves new state-of-the-art performance on VBench benchmark.
中文: 本文提出迭代偏好优化(IPO)策略,通过自动反馈和多轮优化使视频基础模型与人类偏好对齐,从而提升生成视频质量,并以更少参数量实现更优性能。
English: This paper introduces an Iterative Preference Optimization (IPO) strategy to enhance video generation quality by aligning video foundation models with human preferences through automated feedback and multi-round optimization, achieving superior performance with fewer parameters.

Authors:Peng Lu, Ivan Kobyzev, Mehdi Rezagholizadeh, Boxing Chen, Philippe Langlais
Title: ReGLA: Refining Gated Linear Attention
Abstract:
Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage requirements, primarily due to the quadratic computation complexity of softmax attention. To mitigate this issue, linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers. In this work, we embarked on a comprehensive exploration of three key components that substantially impact the performance of the Gated Linear Attention module: feature maps, normalization, and the gating mechanism. We developed a feature mapping function to address some crucial issues that previous suggestions overlooked. Then we offered further rationale for the integration of normalization layers to stabilize the training process. Moreover, we explored the saturation phenomenon of the gating mechanism and augmented it with a refining module. We conducted extensive experiments and showed our architecture outperforms previous Gated Linear Attention mechanisms in extensive tasks including training from scratch and post-linearization with continual pre-training.
Chinese: 本研究针对大型语言模型的计算挑战,通过改进门控线性注意力模块中的特征映射、归一化和门控机制,在多项任务中实现了优于以往方法的性能表现。
English: Recent advancements in Large Language Models have led to high performance but face computational challenges, which this study addresses by improving the Gated Linear Attention module through enhanced feature maps, normalization, and gating mechanisms, achieving superior results in various tasks.

Authors:Christos Dalamagkas, Panagiotis Radoglou-Grammatikis, Pavlos Bouzinis, Ioannis Papadopoulos, Thomas Lagkas, Vasileios Argyriou, Sotirios Goudos, Dimitrios Margounakis, Eleftherios Fountoukidis, Panagiotis Sarigiannidis
Title: Federated Detection of Open Charge Point Protocol 1.6 Cyberattacks
Abstract:
The ongoing electrification of the transportation sector requires the deployment of multiple Electric Vehicle (EV) charging stations across multiple locations. However, the EV charging stations introduce significant cyber-physical and privacy risks, given the presence of vulnerable communication protocols, like the Open Charge Point Protocol (OCPP). Meanwhile, the Federated Learning (FL) paradigm showcases a novel approach for improved intrusion detection results that utilize multiple sources of Internet of Things data, while respecting the confidentiality of private information. This paper proposes the adoption of the FL architecture for the monitoring of the EV charging infrastructure and the detection of cyberattacks against the OCPP 1.6 protocol. The evaluation results showcase high detection performance of the proposed FL-based solution.
中文: 本文提出采用联邦学习架构来监测电动汽车充电基础设施并检测针对OCPP 1.6协议的网络攻击,在保护数据隐私的同时展现出优异的检测性能。
English: This paper proposes a federated learning architecture to monitor EV charging infrastructure and detect cyberattacks on the OCPP 1.6 protocol, demonstrating high detection performance while protecting data privacy.

Authors:Minh Nhat Vu, Alexander Wachter, Gerald Ebmer, Marc-Philip Ecker, Tobias Glück, Anh Nguyen, Wolfgang Kemmetmueller, Andreas Kugi
Title: Towards Autonomous Wood-Log Grasping with a Forestry Crane: Simulator and Benchmarking
Abstract:
Forestry machines operated in forest production environments face challenges when performing manipulation tasks, especially regarding the complicated dynamics of underactuated crane systems and the heavy weight of logs to be grasped. This study investigates the feasibility of using reinforcement learning for forestry crane manipulators in grasping and lifting heavy wood logs autonomously. We first build a simulator using Mujoco physics engine to create realistic scenarios, including modeling a forestry crane with 8 degrees of freedom from CAD data and wood logs of different sizes. We further implement a velocity controller for autonomous log grasping with deep reinforcement learning using a curriculum strategy. Utilizing our new simulator, the proposed control strategy exhibits a success rate of 96% when grasping logs of different diameters and under random initial configurations of the forestry crane. In addition, reward functions and reinforcement learning baselines are implemented to provide an open-source benchmark for the community in large-scale manipulation tasks. A video with several demonstrations can be seen at https://www.acin.tuwien.ac.at/en/d18a/
中文: 本研究证明,通过定制模拟器和课程策略,强化学习能让林业起重机机械臂自主抓取重型原木,成功率高达96%。
English: This study demonstrates that reinforcement learning enables forestry crane manipulators to autonomously grasp heavy logs with 96% success rate using a custom simulator and curriculum strategy.

Authors:Xinze Wang, Chen Chen, Yinfei Yang, Hong-You Chen, Bowen Zhang, Aditya Pal, Xiangxin Zhu, Xianzhi Du
Title: CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
Abstract:
Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. While integrating MoE into multimodal models like CLIP improves performance, training these models is notoriously challenging and expensive. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture. Through extensive experimentation with various settings and auxiliary losses, we demonstrate that CLIP-UP significantly reduces training complexity and cost. Remarkably, our sparse CLIP B/16 model, trained with CLIP-UP, outperforms its dense counterpart by 7.2% and 6.6% on COCO and Flickr30k text-to-image Recall@1 benchmarks respectively. It even surpasses the larger CLIP L/14 model on this task while using only 30% of the inference FLOPs. We further demonstrate the generalizability of our training recipe across different scales, establishing sparse upcycling as a practical and scalable approach for building efficient, high-performance CLIP models.
Chinese: CLIP-UP 通过将预训练的密集 CLIP 模型高效转换为稀疏专家混合架构,显著降低了训练成本,在基准测试中实现更优性能的同时减少了推理计算量。
English: CLIP-UP efficiently transforms pre-trained dense CLIP models into sparse Mixture-of-Experts architectures, significantly reducing training costs while achieving superior performance on benchmarks and using fewer inference resources.

Authors:Walter Zimmer, Ross Greer, Xingcheng Zhou, Rui Song, Marc Pavel, Daniel Lehmberg, Ahmed Ghita, Akshay Gopalkrishnan, Mohan Trivedi, Alois Knoll
Title: Enhancing Highway Safety: Accident Detection on the A9 Test Stretch Using Roadside Sensors
Abstract:
Road traffic injuries are the leading cause of death for people aged 5-29, resulting in about 1.19 million deaths each year. To reduce these fatalities, it is essential to address human errors like speeding, drunk driving, and distractions. Additionally, faster accident detection and quicker medical response can help save lives. We propose an accident detection framework that combines a rule-based approach with a learning-based one. We introduce a dataset of real-world highway accidents featuring high-speed crash sequences. It includes 294,924 labeled 2D boxes, 93,012 labeled 3D boxes, and track IDs across 48,144 frames captured at 10 Hz using four roadside cameras and LiDAR sensors. The dataset covers ten object classes and is released in the OpenLABEL format. Our experiments and analysis demonstrate the reliability of our method.
中文: 道路交通伤害是5至29岁人群的主要死因,为减少伤亡,本研究提出了一种结合规则与学习方法的混合事故检测框架,并在真实高速公路数据集上验证了其可靠性。
English: Road traffic injuries are the leading cause of death for young people aged 5-29, and to reduce fatalities, this study proposes a hybrid accident detection framework combining rule-based and learning-based approaches, validated on a comprehensive real-world highway dataset.

Authors:Haonan An, Guang Hua, Zhengru Fang, Guowen Xu, Susanto Rahardja, Yuguang Fang
Title: Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal
Abstract:
The intellectual property of deep image-to-image models can be protected by the so-called box-free watermarking. It uses an encoder and a decoder, respectively, to embed into and extract from the model's output images invisible copyright marks. Prior works have improved watermark robustness, focusing on the design of better watermark encoders. In this paper, we reveal an overlooked vulnerability of the unprotected watermark decoder which is jointly trained with the encoder and can be exploited to train a watermark removal network. To defend against such an attack, we propose the decoder gradient shield (DGS) as a protection layer in the decoder API to prevent gradient-based watermark removal with a closed-form solution. The fundamental idea is inspired by the classical adversarial attack, but is utilized for the first time as a defensive mechanism in the box-free model watermarking. We then demonstrate that DGS can reorient and rescale the gradient directions of watermarked queries and stop the watermark remover's training loss from converging to the level without DGS, while retaining decoder output image quality. Experimental results verify the effectiveness of proposed method. Code of paper will be made available upon acceptance.
Chinese: 本文揭示了无盒水印系统中未受保护的解码器存在漏洞,攻击者可利用其去除水印,并提出解码器梯度屏蔽(DGS)作为防御机制,在保持图像质量的同时有效阻止此类攻击。
English: This paper identifies a vulnerability in the unprotected decoder of box-free watermarking systems, where attackers can exploit it to remove watermarks, and proposes a decoder gradient shield (DGS) as a defensive mechanism to prevent such attacks while maintaining image quality.

Authors:Lei Yang, Renren Jin, Ling Shi, Jianxiang Peng, Yue Chen, Deyi Xiong
Title: ProBench: Benchmarking Large Language Models in Competitive Programming
Abstract:
With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the capability of advanced LLMs in code reasoning. To bridge the gap for high-level code reasoning assessment, we propose ProBench to benchmark LLMs in competitive programming, drawing inspiration from the International Collegiate Programming Contest. ProBench collects a comprehensive set of competitive programming problems from Codeforces, Luogu, and Nowcoder platforms during the period from July to December 2024, obtaining real test results through online submissions to ensure the fairness and accuracy of the evaluation. We establish a unified problem attribute system, including difficulty grading and algorithm tagging. With carefully collected and annotated data in ProBench, we systematically assess 9 latest LLMs in competitive programming across multiple dimensions, including thought chain analysis, error type diagnosis, and reasoning depth evaluation. Experimental results show that QwQ-32B-Preview achieves the best score of 20.93 followed by DeepSeek-V3 with a score of 16.38, suggesting that models trained with specialized reasoning tasks significantly outperform general-purpose models (even larger than reasoning-oriented models) in programming. Further analysis also reveals key areas for programming capability enhancement, e.g., algorithm adaptability and reasoning sufficiency, providing important insights for the future development of reasoning models.
中文摘要:ProBench作为新型基准测试工具,针对高级大语言模型在编程竞赛中的代码推理能力进行评估,结果表明经过专项推理训练的模型(如QwQ-32B-Preview)显著优于通用模型,并揭示了算法适应性与推理充分性等关键改进方向。
English Summary: ProBench is a new benchmark designed to evaluate advanced large language models' code reasoning capabilities in competitive programming, revealing that specialized reasoning-trained models like QwQ-32B-Preview outperform general-purpose models and identifying key areas for future improvement.

Authors:Jiawen Li, Jiali Hu, Qiehe Sun, Renao Yan, Minxi Ouyang, Tian Guan, Anjia Han, Chao He, Yonghong He
Title: Can We Simplify Slide-level Fine-tuning of Pathology Foundation Models?
Abstract:
The emergence of foundation models in computational pathology has transformed histopathological image analysis, with whole slide imaging (WSI) diagnosis being a core application. Traditionally, weakly supervised fine-tuning via multiple instance learning (MIL) has been the primary method for adapting foundation models to WSIs. However, in this work we present a key experimental finding: a simple nonlinear mapping strategy combining mean pooling and a multilayer perceptron, called SiMLP, can effectively adapt patch-level foundation models to slide-level tasks without complex MIL-based learning. Through extensive experiments across diverse downstream tasks, we demonstrate the superior performance of SiMLP with state-of-the-art methods. For instance, on a large-scale pan-cancer classification task, SiMLP surpasses popular MIL-based methods by 3.52%. Furthermore, SiMLP shows strong learning ability in few-shot classification and remaining highly competitive with slide-level foundation models pretrained on tens of thousands of slides. Finally, SiMLP exhibits remarkable robustness and transferability in lung cancer subtyping. Overall, our findings challenge the conventional MIL-based fine-tuning paradigm, demonstrating that a task-agnostic representation strategy alone can effectively adapt foundation models to WSI analysis. These insights offer a unique and meaningful perspective for future research in digital pathology, paving the way for more efficient and broadly applicable methodologies.
中文: 本研究提出SiMLP这一简单非线性映射方法,无需复杂多示例学习即可将补丁级基础模型有效适配于全切片图像任务,在多种应用中展现出优于传统方法的性能、鲁棒性和可迁移性。
English: The study introduces SiMLP, a straightforward nonlinear mapping method that effectively adapts patch-level foundation models to whole slide image tasks, outperforming traditional multiple instance learning approaches and demonstrating superior performance, robustness, and transferability across various applications.

Authors:Shaobo Wang, Yicun Yang, Zhiyuan Liu, Chenghao Sun, Xuming Hu, Conghui He, Linfeng Zhang
Title: Dataset Distillation with Neural Characteristic Function: A Minmax Perspective
Abstract:
Dataset distillation has emerged as a powerful approach for reducing data requirements in deep learning. Among various methods, distribution matching-based approaches stand out for their balance of computational efficiency and strong performance. However, existing distance metrics used in distribution matching often fail to accurately capture distributional differences, leading to unreliable measures of discrepancy. In this paper, we reformulate dataset distillation as a minmax optimization problem and introduce Neural Characteristic Function Discrepancy (NCFD), a comprehensive and theoretically grounded metric for measuring distributional differences. NCFD leverages the Characteristic Function (CF) to encapsulate full distributional information, employing a neural network to optimize the sampling strategy for the CF's frequency arguments, thereby maximizing the discrepancy to enhance distance estimation. Simultaneously, we minimize the difference between real and synthetic data under this optimized NCFD measure. Our approach, termed Neural Characteristic Function Matching (\mymethod{}), inherently aligns the phase and amplitude of neural features in the complex plane for both real and synthetic data, achieving a balance between realism and diversity in synthetic samples. Experiments demonstrate that our method achieves significant performance gains over state-of-the-art methods on both low- and high-resolution datasets. Notably, we achieve a 20.5\% accuracy boost on ImageSquawk. Our method also reduces GPU memory usage by over 300$\times$ and achieves 20$\times$ faster processing speeds compared to state-of-the-art methods. To the best of our knowledge, this is the first work to achieve lossless compression of CIFAR-100 on a single NVIDIA 2080 Ti GPU using only 2.3 GB of memory.
中文: 本文提出神经特征函数匹配方法,通过将数据集蒸馏重构为极小极大优化问题并采用理论完备的分布差异度量,在提升性能的同时大幅降低了计算资源消耗。
English: This paper introduces Neural Characteristic Function Matching, a novel dataset distillation method that reformulates the problem as minmax optimization and uses a theoretically grounded metric to enhance distribution matching, achieving significant performance gains and computational efficiency improvements.

Authors:Sirui Xu, Hung Yu Ling, Yu-Xiong Wang, Liang-Yan Gui
Title: InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
Abstract:
Achieving realistic simulations of humans interacting with a wide range of objects has long been a fundamental goal. Extending physics-based motion imitation to complex human-object interactions (HOIs) is challenging due to intricate human-object coupling, variability in object geometries, and artifacts in motion capture data, such as inaccurate contacts and limited hand detail. We introduce InterMimic, a framework that enables a single policy to robustly learn from hours of imperfect MoCap data covering diverse full-body interactions with dynamic and varied objects. Our key insight is to employ a curriculum strategy -- perfect first, then scale up. We first train subject-specific teacher policies to mimic, retarget, and refine motion capture data. Next, we distill these teachers into a student policy, with the teachers acting as online experts providing direct supervision, as well as high-quality references. Notably, we incorporate RL fine-tuning on the student policy to surpass mere demonstration replication and achieve higher-quality solutions. Our experiments demonstrate that InterMimic produces realistic and diverse interactions across multiple HOI datasets. The learned policy generalizes in a zero-shot manner and seamlessly integrates with kinematic generators, elevating the framework from mere imitation to generative modeling of complex human-object interactions.
中文摘要:InterMimic框架通过“先完善后扩展”的课程策略,结合师生策略蒸馏和强化学习微调,能够从有缺陷的运动捕捉数据中稳健学习多样化人-物交互,生成逼真交互行为,并实现从单纯模仿到复杂交互生成建模的零样本泛化能力。
English Summary: InterMimic is a framework that enables robust learning from imperfect motion capture data for diverse human-object interactions through a curriculum strategy combining teacher-student distillation and RL fine-tuning, producing realistic interactions and generalizing beyond mere imitation to generative modeling.

Authors:Chao Feng, Alberto Huertas Celdrán, Xi Cheng, Gérôme Bovet, Burkhard Stiller
Title: GreenDFL: a Framework for Assessing the Sustainability of Decentralized Federated Learning Systems
Abstract:
Decentralized Federated Learning (DFL) is an emerging paradigm that enables collaborative model training without centralized data and model aggregation, enhancing privacy and resilience. However, its sustainability remains underexplored, as energy consumption and carbon emissions vary across different system configurations. Understanding the environmental impact of DFL is crucial for optimizing its design and deployment. This work aims to develop a comprehensive and operational framework for assessing the sustainability of DFL systems. To address it, this work provides a systematic method for quantifying energy consumption and carbon emissions, offering insights into improving the sustainability of DFL. This work proposes GreenDFL, a fully implementable framework that has been integrated into a real-world DFL platform. GreenDFL systematically analyzes the impact of various factors, including hardware accelerators, model architecture, communication medium, data distribution, network topology, and federation size, on the sustainability of DFL systems. Besides, a sustainability-aware aggregation algorithm (GreenDFL-SA) and a node selection algorithm (GreenDFL-SN) are developed to optimize energy efficiency and reduce carbon emissions in DFL training. Empirical experiments are conducted on multiple datasets, measuring energy consumption and carbon emissions at different phases of the DFL lifecycle. The proposed GreenDFL provides a comprehensive and practical approach for assessing the sustainability of DFL systems. Furthermore, it offers best practices for improving environmental efficiency in DFL, making sustainability considerations more actionable in real-world deployments.
中文: 本研究提出了GreenDFL框架,通过量化能耗与碳排放来评估去中心化联邦学习系统的可持续性,并提供优化算法以减少环境影响,为实际部署提供可行方案。
English: This work introduces GreenDFL, a practical framework for evaluating and enhancing the sustainability of Decentralized Federated Learning systems by quantifying energy consumption and carbon emissions, while offering optimization algorithms to reduce environmental impact.

Authors:Chenhao Ding, Xinyuan Gao, Songlin Dong, Yuhang He, Qiang Wang, Xiang Song, Alex Kot, Yihong Gong
Title: Space Rotation with Basis Transformation for Training-free Test-Time Adaptation
Abstract:
With the development of visual-language models (VLM) in downstream task applications, test-time adaptation methods based on VLM have attracted increasing attention for their ability to address changes distribution in test-time. Although prior approaches have achieved some progress, they typically either demand substantial computational resources or are constrained by the limitations of the original feature space, rendering them less effective for test-time adaptation tasks. To address these challenges, we propose a training-free feature space rotation with basis transformation for test-time adaptation. By leveraging the inherent distinctions among classes, we reconstruct the original feature space and map it to a new representation, thereby enhancing the clarity of class differences and providing more effective guidance for the model during testing. Additionally, to better capture relevant information from various classes, we maintain a dynamic queue to store representative samples. Experimental results across multiple benchmarks demonstrate that our method outperforms state-of-the-art techniques in terms of both performance and efficiency.
中文: 本研究提出了一种无需训练的特征空间旋转方法,通过基变换实现视觉语言模型的测试时适应,重构特征空间以增强类别区分度,并采用动态队列存储代表性样本,在多个基准测试中性能和效率均优于现有技术。
English: This study introduces a training-free feature space rotation method using basis transformation for test-time adaptation in visual-language models, which reconstructs the feature space to enhance class distinction clarity and employs a dynamic queue for representative samples, outperforming current techniques in performance and efficiency across benchmarks.

Authors:Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Title: CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
Abstract:
Contrastive Language-Image Pre-training (CLIP) models excel in zero-shot classification, yet face challenges in complex multi-object scenarios. This study offers a comprehensive analysis of CLIP's limitations in these contexts using a specialized dataset, ComCO, designed to evaluate CLIP's encoders in diverse multi-object scenarios. Our findings reveal significant biases: the text encoder prioritizes first-mentioned objects, and the image encoder favors larger objects. Through retrieval and classification tasks, we quantify these biases across multiple CLIP variants and trace their origins to CLIP's training process, supported by analyses of the LAION dataset and training progression. Our image-text matching experiments show substantial performance drops when object size or token order changes, underscoring CLIP's instability with rephrased but semantically similar captions. Extending this to longer captions and text-to-image models like Stable Diffusion, we demonstrate how prompt order influences object prominence in generated images. For more details and access to our dataset and analysis code, visit our project repository: https://clip-oscope.github.io.
Chinese: CLIP模型在多对象场景中存在显著偏差,文本编码器偏向首先提及的对象,图像编码器偏好较大对象,导致语义相似描述下性能不稳定,并影响生成图像中对象的突出程度。
English: CLIP models exhibit significant biases in multi-object scenarios, with text encoders prioritizing first-mentioned objects and image encoders favoring larger ones, leading to performance instability with semantically similar captions and influencing object prominence in generated images.

Authors:Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Title: Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study
Abstract:
Contrastive Language-Image Pre-training (CLIP) models have demonstrated remarkable performance in zero-shot classification tasks, yet their efficacy in handling complex multi-object scenarios remains challenging. This study presents a comprehensive analysis of CLIP's performance limitations in multi-object contexts through controlled experiments. We introduce two custom datasets, SimCO and CompCO, to evaluate CLIP's image and text encoders in various multi-object configurations. Our findings reveal significant biases in both encoders: the image encoder favors larger objects, while the text encoder prioritizes objects mentioned first in descriptions. We hypothesize these biases originate from CLIP's training process and provide evidence through analyses of the COCO dataset and CLIP's training progression. Additionally, we extend our investigation to Stable Diffusion models, revealing that biases in the CLIP text encoder significantly impact text-to-image generation tasks. Our experiments demonstrate how these biases affect CLIP's performance in image-caption matching and generation tasks, particularly when manipulating object sizes and their order in captions. This work contributes valuable insights into CLIP's behavior in complex visual environments and highlights areas for improvement in future vision-language models.
中文摘要:本研究揭示CLIP模型存在显著偏差:图像编码器偏向较大物体,文本编码器优先处理描述中先出现的对象,这些训练导致的局限影响了多对象场景处理和文生图任务的性能。
English Summary: This study identifies significant biases in CLIP models where the image encoder favors larger objects and the text encoder prioritizes first-mentioned objects, revealing how these limitations from training affect performance in multi-object scenarios and text-to-image generation.

Authors:Yeonjun In, Kanghoon Yoon, Sukwon Yun, Kibum Kim, Sungchul Kim, Chanyoung Park
Title: Training Robust Graph Neural Networks by Modeling Noise Dependencies
Abstract:
In real-world applications, node features in graphs often contain noise from various sources, leading to significant performance degradation in GNNs. Although several methods have been developed to enhance robustness, they rely on the unrealistic assumption that noise in node features is independent of the graph structure and node labels, thereby limiting their applicability. To this end, we introduce a more realistic noise scenario, dependency-aware noise on graphs (DANG), where noise in node features create a chain of noise dependencies that propagates to the graph structure and node labels. We propose a novel robust GNN, DA-GNN, which captures the causal relationships among variables in the data generating process (DGP) of DANG using variational inference. In addition, we present new benchmark datasets that simulate DANG in real-world applications, enabling more practical research on robust GNNs. Extensive experiments demonstrate that DA-GNN consistently outperforms existing baselines across various noise scenarios, including both DANG and conventional noise models commonly considered in this field.
Graph neural networks often suffer from performance degradation due to noisy node features, and existing robustness methods are limited by unrealistic independence assumptions; this paper introduces a more realistic dependency-aware noise model (DANG) and proposes DA-GNN, a novel robust GNN that captures causal relationships through variational inference and demonstrates superior performance across various noise scenarios.
English Summary:

Authors:Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, Jiayuan Zhu, Hongwei Bran Li, Chen Chen, Cheng Ouyang, Daniel Rueckert
Title: MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning
Abstract:
Reasoning is a critical frontier for advancing medical image analysis, where transparency and trustworthiness play a central role in both clinician trust and regulatory approval. Although Medical Visual Language Models (VLMs) show promise for radiological tasks, most existing VLMs merely produce final answers without revealing the underlying reasoning. To address this gap, we introduce MedVLM-R1, a medical VLM that explicitly generates natural language reasoning to enhance transparency and trustworthiness. Instead of relying on supervised fine-tuning (SFT), which often suffers from overfitting to training distributions and fails to foster genuine reasoning, MedVLM-R1 employs a reinforcement learning framework that incentivizes the model to discover human-interpretable reasoning paths without using any reasoning references. Despite limited training data (600 visual question answering samples) and model parameters (2B), MedVLM-R1 boosts accuracy from 55.11% to 78.22% across MRI, CT, and X-ray benchmarks, outperforming larger models trained on over a million samples. It also demonstrates robust domain generalization under out-of-distribution tasks. By unifying medical image analysis with explicit reasoning, MedVLM-R1 marks a pivotal step toward trustworthy and interpretable AI in clinical practice. Inference model is available at: https://huggingface.co/JZPeterPan/MedVLM-R1.
Chinese: MedVLM-R1通过强化学习框架开发出能生成明确自然语言推理的医学视觉语言模型,在有限数据和参数下显著提升了多模态影像分析的准确性与可解释性,推动了临床可信AI的发展。
English: MedVLM-R1 introduces a reinforcement learning-based medical visual language model that generates explicit natural language reasoning to enhance transparency and accuracy, achieving significant performance improvements across multiple imaging modalities despite limited data and parameters.

Authors:Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn
Title: FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users
Abstract:
Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.
中文: 少样本偏好优化(FSPO)通过少量用户偏好数据使大语言模型快速个性化适配,在合成与真实用户测试中均实现了高胜率的定制化生成效果。
English: Few-Shot Preference Optimization (FSPO) enables LLMs to quickly adapt to individual users through minimal preference data, achieving high personalization success rates in both synthetic and human evaluations.

Authors:Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, Haizhou Li
Title: Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
Abstract:
User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, current role-playing methods face challenges such as a lack of utterance-level authenticity and user-level diversity, often hindered by role confusion and dependence on predefined profiles of well-known figures. In contrast, direct simulation focuses solely on text, neglecting implicit user traits like personality and conversation-level consistency. To address these issues, we introduce the User Simulator with Implicit Profiles (USP), a framework that infers implicit user profiles from human-machine interactions to simulate personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema, then refine the simulation using conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing at both the utterance and conversation levels. Finally, a diverse profile sampler captures the distribution of real-world user profiles. Experimental results show that USP outperforms strong baselines in terms of authenticity and diversity while maintaining comparable consistency. Additionally, using USP to evaluate LLM on dynamic multi-turn aligns well with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
中文摘要:USP框架通过从人机交互中推断隐含用户特征,采用大语言模型驱动的提取和强化学习方法,解决了现有用户模拟方法在真实性和多样性上的不足,在保持一致性的同时显著提升了对话质量。
English Summary: The User Simulator with Implicit Profiles (USP) framework addresses limitations in current user simulation methods by inferring implicit user traits from interactions, employing LLM-driven extraction and reinforcement learning to enhance dialogue authenticity and diversity while maintaining consistency.

Authors:Sen Yang, Yafu Li, Wai Lam, Yu Cheng
Title: Multi-LLM Collaborative Search for Complex Problem Solving
Abstract:
Large language models (LLMs) often struggle with complex reasoning tasks due to their limitations in addressing the vast reasoning space and inherent ambiguities of natural language. We propose the Mixture-of-Search-Agents (MoSA) paradigm, a novel approach leveraging the collective expertise of multiple LLMs to enhance search-based reasoning. MoSA integrates diverse reasoning pathways by combining independent exploration with iterative refinement among LLMs, mitigating the limitations of single-model approaches. Using Monte Carlo Tree Search (MCTS) as a backbone, MoSA enables multiple agents to propose and aggregate reasoning steps, resulting in improved accuracy. Our comprehensive evaluation across four reasoning benchmarks demonstrates MoSA's consistent performance improvements over single-agent and other multi-agent baselines, particularly in complex mathematical and commonsense reasoning tasks.
中文: MoSA(混合搜索代理)范式通过整合多个大语言模型的集体智慧,基于蒙特卡洛树搜索提升复杂推理能力,在多项基准测试中持续优于单代理方法。
English: The Mixture-of-Search-Agents (MoSA) paradigm enhances complex reasoning by leveraging multiple LLMs' collective expertise through Monte Carlo Tree Search, consistently outperforming single-agent approaches across various benchmarks.

Authors:Feibo Jiang, Wanyun Zhu, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan, Octavia A. Dobre
Title: CommGPT: A Graph and Retrieval-Augmented Multimodal Communication Foundation Model
Abstract:
Large Language Models (LLMs) possess human-level cognitive and decision-making capabilities, making them a key technology for 6G. However, applying LLMs to the communication domain faces three major challenges: 1) Inadequate communication data; 2) Restricted input modalities; and 3) Difficulty in knowledge retrieval. To overcome these issues, we propose CommGPT, a multimodal foundation model designed specifically for communications. First, we create high-quality pretraining and fine-tuning datasets tailored in communication, enabling the LLM to engage in further pretraining and fine-tuning with communication concepts and knowledge. Then, we design a multimodal encoder to understand and process information from various input modalities. Next, we construct a Graph and Retrieval-Augmented Generation (GRG) framework, efficiently coupling Knowledge Graph (KG) with Retrieval-Augmented Generation (RAG) for multi-scale learning. Finally, we demonstrate the feasibility and effectiveness of the CommGPT through experimental validation.
Chinese: 大语言模型对6G至关重要,但在通信领域应用面临挑战;CommGPT通过定制数据集、多模态编码器及图与检索增强生成框架解决这些问题,实验验证了其可行性和有效性。
English: Large Language Models (LLMs) are pivotal for 6G but face challenges in communication applications, which CommGPT addresses by creating tailored datasets, a multimodal encoder, and a Graph and Retrieval-Augmented Generation framework, proving effective through experiments.

Authors:Brian Hu Zhang, Ioannis Anagnostides, Emanuel Tewolde, Ratip Emin Berker, Gabriele Farina, Vincent Conitzer, Tuomas Sandholm
Title: Expected Variational Inequalities
Abstract:
Variational inequalities (VIs) encompass many fundamental problems in diverse areas ranging from engineering to economics and machine learning. However, their considerable expressivity comes at the cost of computational intractability. In this paper, we introduce and analyze a natural relaxation -- which we refer to as expected variational inequalities (EVIs) -- where the goal is to find a distribution that satisfies the VI constraint in expectation. By adapting recent techniques from game theory, we show that, unlike VIs, EVIs can be solved in polynomial time under general (nonmonotone) operators. EVIs capture the seminal notion of correlated equilibria, but enjoy a greater reach beyond games. We also employ our framework to capture and generalize several existing disparate results, including from settings such as smooth games, and games with coupled constraints or nonconcave utilities.
中文: 本文提出期望变分不等式(EVIs),作为变分不等式的松弛形式,可在多项式时间内求解一般非单调算子问题,不仅涵盖相关均衡概念,还能推广并统一多个领域的现有研究成果。
English: This paper introduces expected variational inequalities (EVIs), a relaxation of variational inequalities that can be solved in polynomial time for general nonmonotone operators, extending beyond games to capture correlated equilibria and generalize existing results.

Authors:Brian Hu Zhang, Ioannis Anagnostides, Emanuel Tewolde, Ratip Emin Berker, Gabriele Farina, Vincent Conitzer, Tuomas Sandholm
Title: Learning and Computation of $Φ$-Equilibria at the Frontier of Tractability
Abstract:
$Φ$-equilibria -- and the associated notion of $Φ$-regret -- are a powerful and flexible framework at the heart of online learning and game theory, whereby enriching the set of deviations $Φ$ begets stronger notions of rationality. Recently, Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC '24) -- abbreviated as DFFPS -- settled the existence of efficient algorithms when $Φ$ contains only linear maps under a general, $d$-dimensional convex constraint set $\mathcal{X}$. In this paper, we significantly extend their work by resolving the case where $Φ$ is $k$-dimensional; degree-$\ell$ polynomials constitute a canonical such example with $k = d^{O(\ell)}$. In particular, positing only oracle access to $\mathcal{X}$, we obtain two main positive results: i) a $\text{poly}(n, d, k, \text{log}(1/ε))$-time algorithm for computing $ε$-approximate $Φ$-equilibria in $n$-player multilinear games, and ii) an efficient online algorithm that incurs average $Φ$-regret at most $ε$ using $\text{poly}(d, k)/ε^2$ rounds. We also show nearly matching lower bounds in the online learning setting, thereby obtaining for the first time a family of deviations that captures the learnability of $Φ$-regret. From a technical standpoint, we extend the framework of DFFPS from linear maps to the more challenging case of maps with polynomial dimension. At the heart of our approach is a polynomial-time algorithm for computing an expected fixed point of any $ϕ: \mathcal{X} \to \mathcal{X}$ based on the ellipsoid against hope (EAH) algorithm of Papadimitriou and Roughgarden (JACM '08). In particular, our algorithm for computing $Φ$-equilibria is based on executing EAH in a nested fashion -- each step of EAH itself being implemented by invoking a separate call to EAH.
中文摘要:本文扩展了DFFPS的研究,针对k维多项式映射的Φ-均衡计算和Φ-遗憾最小化问题,提出了高效算法,并在在线学习场景中获得了近乎最优的结果与匹配下界。
English Summary: This paper extends the work of DFFPS by developing efficient algorithms for computing Φ-equilibria and minimizing Φ-regret when Φ consists of k-dimensional polynomial maps, achieving nearly optimal results with matching lower bounds in online learning.

Authors:Tong Ye, Weigang Huang, Xuhong Zhang, Tengfei Ma, Peiyu Liu, Jianwei Yin, Wenhai Wang
Title: LLM4EFFI: Leveraging Large Language Models to Enhance Code Efficiency and Correctness
Abstract:
Large Language Models (LLMs), particularly Code LLMs, have demonstrated impressive performance in code generation. Current research primarily focuses on the correctness of generated code, while efficiency remains less explored. Recent works have focused on modifying the initial version of the code to improve its efficiency. However, such refinements are limited by the algorithmic design and overall logic of the initial code, resulting in only incremental improvements. In contrast, when human developers write high-quality code, they typically begin by designing several potential solutions at the logical level, evaluating various algorithms and their complexities, and then proceeding to implement and optimize the solution. In this study, we introduce \tool: \uline{L}arge \uline{L}anguage \uline{M}odel for Code \uline{Effi}ciency, a novel framework that enables LLMs to generate code that balances both efficiency and correctness. Specifically, \tool divides the efficiency optimization process into two domains: algorithmic exploration in the logic domain and implementation optimization in the code domain. The correctness of the code is then guaranteed through a synthetic test case refinement process. This approach, which prioritizes efficiency before ensuring correctness, offers a new paradigm for efficient code generation. Experiments demonstrate that \tool consistently improves both efficiency and correctness, achieving new state-of-the-art performance in code efficiency benchmarks across various LLM backbones.
中文: 本研究提出了\tool框架,通过先在逻辑层面探索算法方案再优化代码实现,使大语言模型能生成兼顾效率与正确性的代码,在代码效率基准测试中取得了最优性能。
English: This study introduces \tool, a framework that enables Large Language Models to generate efficient and correct code by first exploring algorithmic solutions and then optimizing implementations, achieving state-of-the-art performance in code efficiency benchmarks.

Authors:Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, Lin Liang, Lijuan Wang, Ji Li, Xiu Li, Zhouhui Lian, Gao Huang, Baining Guo
Title: ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
Abstract:
Multi-layer image generation is a fundamental task that enables users to isolate, select, and edit specific image layers, thereby revolutionizing interactions with generative models. In this paper, we introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images based on a global text prompt and an anonymous region layout. Inspired by Schema theory suggests that knowledge is organized in frameworks (schemas) that enable people to interpret and learn from new information by linking it to prior knowledge.}, this anonymous region layout allows the generative model to autonomously determine which set of visual tokens should align with which text tokens, which is in contrast to the previously dominant semantic layout for the image generation task. In addition, the layer-wise region crop mechanism, which only selects the visual tokens belonging to each anonymous region, significantly reduces attention computation costs and enables the efficient generation of images with numerous distinct layers (e.g., 50+). When compared to the full attention approach, our method is over 12 times faster and exhibits fewer layer conflicts. Furthermore, we propose a high-quality multi-layer transparent image autoencoder that supports the direct encoding and decoding of the transparency of variable multi-layer images in a joint manner. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
中文摘要:本文提出匿名区域变换器(ART),通过文本提示和匿名布局直接生成可变多层透明图像,其分层区域裁剪机制显著降低计算成本并支持可扩展的交互式内容创作。
English Summary: This paper introduces the Anonymous Region Transformer (ART), a novel method for directly generating variable multi-layer transparent images from text prompts and anonymous layouts, which enhances efficiency by reducing attention costs and enabling scalable layer generation.

Authors:Manuel Barusco, Francesco Borsatti, Davide Dalle Pezze, Francesco Paissan, Elisabetta Farella, Gian Antonio Susto
Title: From Vision to Sound: Advancing Audio Anomaly Detection with Vision-Based Algorithms
Abstract:
Recent advances in Visual Anomaly Detection (VAD) have introduced sophisticated algorithms leveraging embeddings generated by pre-trained feature extractors. Inspired by these developments, we investigate the adaptation of such algorithms to the audio domain to address the problem of Audio Anomaly Detection (AAD). Unlike most existing AAD methods, which primarily classify anomalous samples, our approach introduces fine-grained temporal-frequency localization of anomalies within the spectrogram, significantly improving explainability. This capability enables a more precise understanding of where and when anomalies occur, making the results more actionable for end users. We evaluate our approach on industrial and environmental benchmarks, demonstrating the effectiveness of VAD techniques in detecting anomalies in audio signals. Moreover, they improve explainability by enabling localized anomaly identification, making audio anomaly detection systems more interpretable and practical.
中文摘要:本研究将视觉异常检测技术应用于音频领域,提出了一种在频谱图中对异常进行细粒度时频定位的方法,显著提升了检测准确性和结果可解释性,使音频异常检测系统更具实用性。
English Summary: This study adapts visual anomaly detection techniques to audio, introducing a method that provides fine-grained temporal-frequency localization of anomalies in spectrograms, enhancing both detection accuracy and explainability for practical applications.

Authors:Zeju Li, Changran Xu, Zhengyuan Shi, Zedong Peng, Yi Liu, Yunhao Zhou, Lingfeng Zhou, Chengyu Ma, Jianyuan Zhong, Xi Wang, Jieru Zhao, Zhufei Chu, Xiaoyan Yang, Qiang Xu
Title: DeepCircuitX: A Comprehensive Repository-Level Dataset for RTL Code Understanding, Generation, and PPA Analysis
Abstract:
This paper introduces DeepCircuitX, a comprehensive repository-level dataset designed to advance RTL (Register Transfer Level) code understanding, generation, and power-performance-area (PPA) analysis. Unlike existing datasets that are limited to either file-level RTL code or physical layout data, DeepCircuitX provides a holistic, multilevel resource that spans repository, file, module, and block-level RTL code. This structure enables more nuanced training and evaluation of large language models (LLMs) for RTL-specific tasks. DeepCircuitX is enriched with Chain of Thought (CoT) annotations, offering detailed descriptions of functionality and structure at multiple levels. These annotations enhance its utility for a wide range of tasks, including RTL code understanding, generation, and completion. Additionally, the dataset includes synthesized netlists and PPA metrics, facilitating early-stage design exploration and enabling accurate PPA prediction directly from RTL code. We demonstrate the dataset's effectiveness on various LLMs finetuned with our dataset and confirm the quality with human evaluations. Our results highlight DeepCircuitX as a critical resource for advancing RTL-focused machine learning applications in hardware design automation.Our data is available at https://zeju.gitbook.io/lcm-team.
中文摘要:本文介绍了DeepCircuitX,这是一个多层级数据集,通过整合不同设计层级的代码、注释和PPA指标,提升了RTL代码分析和机器学习应用的能力。
English Summary: This paper presents DeepCircuitX, a multi-level dataset that enhances RTL code analysis and machine learning applications by integrating code, annotations, and PPA metrics across different design hierarchies.

Authors:Mengzhao Wang, Haotian Wu, Xiangyu Ke, Yunjun Gao, Yifan Zhu, Wenchao Zhou
Title: Accelerating Graph Indexing for ANNS on Modern CPUs
Abstract:
In high-dimensional vector spaces, Approximate Nearest Neighbor Search (ANNS) is a key component in database and artificial intelligence infrastructures. Graph-based methods, particularly HNSW, have emerged as leading solutions among various ANNS approaches, offering an impressive trade-off between search efficiency and accuracy. Many modern vector databases utilize graph indexes as their core algorithms, benefiting from various optimizations to enhance search performance. However, the high indexing time associated with graph algorithms poses a significant challenge, especially given the increasing volume of data, query processing complexity, and dynamic index maintenance demand. This has rendered indexing time a critical performance metric for users. In this paper, we comprehensively analyze the underlying causes of the low graph indexing efficiency on modern CPUs, identifying that distance computation dominates indexing time, primarily due to high memory access latency and suboptimal arithmetic operation efficiency. We demonstrate that distance comparisons during index construction can be effectively performed using compact vector codes at an appropriate compression error. Drawing from insights gained through integrating existing compact coding methods in the graph indexing process, we propose a novel compact coding strategy, named Flash, designed explicitly for graph indexing and optimized for modern CPU architectures. By minimizing random memory accesses and maximizing the utilization of SIMD (Single Instruction, Multiple Data) instructions, Flash significantly enhances cache hit rates and arithmetic operations. Extensive experiments conducted on eight real-world datasets, ranging from ten million to one billion vectors, exhibit that Flash achieves a speedup of 10.4$\times$ to 22.9$\times$ in index construction efficiency, while maintaining or improving search performance.
中文: 基于图的近似最近邻搜索方法如图形索引HNSW因距离计算效率低下导致索引时间过长,而提出的紧凑编码策略Flash通过优化内存访问和SIMD指令使用,在保持搜索性能的同时实现了10倍以上的索引加速。
English: Graph-based approximate nearest neighbor search methods like HNSW face high indexing time due to distance computation inefficiencies, which the proposed compact coding strategy Flash addresses by optimizing memory access and SIMD utilization to achieve over 10x faster indexing while maintaining search performance.

Authors:Pusheng Xu, Yue Wu, Kai Jin, Xiaolan Chen, Mingguang He, Danli Shi
Title: DeepSeek-R1 Outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in Bilingual Complex Ophthalmology Reasoning
Abstract:
Purpose: To evaluate the accuracy and reasoning ability of DeepSeek-R1 and three other recently released large language models (LLMs) in bilingual complex ophthalmology cases. Methods: A total of 130 multiple-choice questions (MCQs) related to diagnosis (n = 39) and management (n = 91) were collected from the Chinese ophthalmology senior professional title examination and categorized into six topics. These MCQs were translated into English using DeepSeek-R1. The responses of DeepSeek-R1, Gemini 2.0 Pro, OpenAI o1 and o3-mini were generated under default configurations between February 15 and February 20, 2025. Accuracy was calculated as the proportion of correctly answered questions, with omissions and extra answers considered incorrect. Reasoning ability was evaluated through analyzing reasoning logic and the causes of reasoning error. Results: DeepSeek-R1 demonstrated the highest overall accuracy, achieving 0.862 in Chinese MCQs and 0.808 in English MCQs. Gemini 2.0 Pro, OpenAI o1, and OpenAI o3-mini attained accuracies of 0.715, 0.685, and 0.692 in Chinese MCQs (all P<0.001 compared with DeepSeek-R1), and 0.746 (P=0.115), 0.723 (P=0.027), and 0.577 (P<0.001) in English MCQs, respectively. DeepSeek-R1 achieved the highest accuracy across five topics in both Chinese and English MCQs. It also excelled in management questions conducted in Chinese (all P<0.05). Reasoning ability analysis showed that the four LLMs shared similar reasoning logic. Ignoring key positive history, ignoring key positive signs, misinterpretation medical data, and too aggressive were the most common causes of reasoning errors. Conclusion: DeepSeek-R1 demonstrated superior performance in bilingual complex ophthalmology reasoning tasks than three other state-of-the-art LLMs. While its clinical applicability remains challenging, it shows promise for supporting diagnosis and clinical decision-making.
中文摘要:DeepSeek-R1在双语复杂眼科推理任务中表现优于其他三种先进大语言模型,在中文和英文医学问题中均获得最高准确率,且推理逻辑相似但关键错误更少。
English Summary: DeepSeek-R1 outperformed three other advanced LLMs in bilingual ophthalmology reasoning tasks, achieving the highest accuracy in both Chinese and English medical questions while demonstrating similar reasoning patterns with fewer critical errors.

Authors:Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov
Title: Yes, Q-learning Helps Offline In-Context RL
Abstract:
Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.
中文: 本研究表明,在离线上下文强化学习中引入强化学习目标相比监督方法平均提升性能30%,在复杂环境中效果翻倍,且价值学习中的保守策略能在多数场景带来额外增益。
English: This study demonstrates that integrating reinforcement learning objectives into offline in-context reinforcement learning significantly boosts performance by 30% on average and doubles it in challenging environments compared to supervised methods, while conservatism in value learning further enhances results across diverse settings.

Authors:Li Dong, Feibo Jiang, Yubo Peng
Title: Attention-based UAV Trajectory Optimization for Wireless Power Transfer-assisted IoT Systems
Abstract:
Unmanned Aerial Vehicles (UAVs) in Wireless Power Transfer (WPT)-assisted Internet of Things (IoT) systems face the following challenges: limited resources and suboptimal trajectory planning. Reinforcement learning-based trajectory planning schemes face issues of low search efficiency and learning instability when optimizing large-scale systems. To address these issues, we present an Attention-based UAV Trajectory Optimization (AUTO) framework based on the graph transformer, which consists of an Attention Trajectory Optimization Model (ATOM) and a Trajectory lEarNing Method based on Actor-critic (TENMA). In ATOM, a graph encoder is used to calculate the self-attention characteristics of all IoTDs, and a trajectory decoder is developed to optimize the number and trajectories of UAVs. TENMA then trains the ATOM using an improved Actor-Critic method, in which the real reward of the system is applied as the baseline to reduce variances in the critic network. This method is suitable for high-quality and large-scale multi-UAV trajectory planning. Finally, we develop numerous experiments, including a hardware experiment in the field case, to verify the feasibility and efficiency of the AUTO framework.
中文摘要:AUTO框架通过基于图变换器的注意力机制和改进的行动者-评论家训练方法,解决了无线供能物联网系统中无人机轨迹优化的资源限制与规划效率问题,并通过实地硬件实验验证了其在大规模应用中的有效性。
English Summary: The AUTO framework addresses UAV trajectory optimization challenges in WPT-assisted IoT systems through an attention-based graph transformer model and improved actor-critic training, demonstrating effectiveness in large-scale deployments via field experiments.

Authors:Zifeng Zhuang, Diyuan Shi, Runze Suo, Xiao He, Hongyin Zhang, Ting Wang, Shangke Lyu, Donglin Wang
Title: TDMPBC: Self-Imitative Reinforcement Learning for Humanoid Robot Control
Abstract:
Complex high-dimensional spaces with high Degree-of-Freedom and complicated action spaces, such as humanoid robots equipped with dexterous hands, pose significant challenges for reinforcement learning (RL) algorithms, which need to wisely balance exploration and exploitation under limited sample budgets. In general, feasible regions for accomplishing tasks within complex high-dimensional spaces are exceedingly narrow. For instance, in the context of humanoid robot motion control, the vast majority of space corresponds to falling, while only a minuscule fraction corresponds to standing upright, which is conducive to the completion of downstream tasks. Once the robot explores into a potentially task-relevant region, it should place greater emphasis on the data within that region. Building on this insight, we propose the $\textbf{S}$elf-$\textbf{I}$mitative $\textbf{R}$einforcement $\textbf{L}$earning ($\textbf{SIRL}$) framework, where the RL algorithm also imitates potentially task-relevant trajectories. Specifically, trajectory return is utilized to determine its relevance to the task and an additional behavior cloning is adopted whose weight is dynamically adjusted based on the trajectory return. As a result, our proposed algorithm achieves 120% performance improvement on the challenging HumanoidBench with 5% extra computation overhead. With further visualization, we find the significant performance gain does lead to meaningful behavior improvement that several tasks are solved successfully.
中文:SIRL框架通过动态模仿高回报轨迹来增强复杂高维空间中的强化学习,在HumanoidBench基准测试中仅增加5%计算开销就实现了120%的性能提升。
English: The SIRL framework enhances reinforcement learning in complex high-dimensional spaces by dynamically prioritizing imitation of high-return trajectories, achieving 120% performance improvement on HumanoidBench with minimal computational overhead.

Authors:Yuhan Liu, Máté Kiss, Roland Tóth, Maarten Schoukens
Title: On Space-Filling Input Design for Nonlinear Dynamic Model Learning: A Gaussian Process Approach
Abstract:
While optimal input design for linear systems has been well-established, no systematic approach exists for nonlinear systems where robustness to extrapolation/interpolation errors is prioritized over minimizing estimated parameter variance. To address this issue, we develop a novel space-filling input design strategy for nonlinear system identification that ensures data coverage of a given region of interest. By placing a Gaussian Process (GP) prior on the joint input-state space, the proposed strategy leverages the GP posterior variance to construct a cost function that promotes space-filling input design. Consequently, this enables maximization of the coverage in the region of interest, thereby facilitating the generation of informative datasets. Furthermore, we theoretically prove that minimization of the cost function implies the space-filling property of the obtained data. Effectiveness of the proposed strategy is demonstrated on both an academic and a mass-spring-damper example, highlighting its potential practical impact on efficient exploration of the dynamics of nonlinear systems.
中文: 本研究提出了一种新颖的非线性系统辨识空间填充输入设计策略,利用高斯过程先验最大化感兴趣区域的数据覆盖,确保鲁棒性并生成信息丰富的数据集,以高效探索系统动态特性。
English: This study introduces a novel space-filling input design strategy for nonlinear system identification that utilizes Gaussian Process priors to maximize data coverage in regions of interest, ensuring robustness and generating informative datasets for efficient exploration of system dynamics.

Authors:Rui Xing, Boyang Sun, Kun Zhang, Preslav Nakov, Timothy Baldwin, Jey Han Lau
Title: An Analytical Emotion Framework of Rumour Threads on Social Media
Abstract:
Rumours in online social media pose significant risks to modern society, motivating the need for better understanding of how they develop. We focus specifically on the interface between emotion and rumours in threaded discourses, building on the surprisingly sparse literature on the topic which has largely focused on single aspect of emotions within the original rumour posts themselves, and largely overlooked the comparative differences between rumours and non-rumours. In this work, we take one step further to provide a comprehensive analytical emotion framework with multi-aspect emotion detection, contrasting rumour and non-rumour threads and provide both correlation and causal analysis of emotions. We applied our framework on existing widely-used rumour datasets to further understand the emotion dynamics in online social media threads. Our framework reveals that rumours trigger more negative emotions (e.g., anger, fear, pessimism), while non-rumours evoke more positive ones. Emotions are contagious, rumours spread negativity, non-rumours spread positivity. Causal analysis shows surprise bridges rumours and other emotions; pessimism comes from sadness and fear, while optimism arises from joy and love.
中文摘要:本研究构建了一个全面的情感分析框架,发现网络社交媒体中的谣言会引发更多愤怒、恐惧等负面情绪,而非谣言则激发积极情绪,因果分析表明惊讶情绪在谣言与其他情感反应间起到桥梁作用。
English Summary: This study develops a comprehensive emotion analysis framework revealing that rumors in online social media trigger more negative emotions like anger and fear, while non-rumors evoke positive emotions, with causal analysis showing surprise bridges rumors to other emotional responses.

Authors:Elias Frantar, Utku Evci, Wonpyo Park, Neil Houlsby, Dan Alistarh
Title: Compression Scaling Laws:Unifying Sparsity and Quantization
Abstract:
We investigate how different compression techniques -- such as weight and activation quantization, and weight sparsity -- affect the scaling behavior of large language models (LLMs) during pretraining. Building on previous work showing that weight sparsity acts as a constant multiplier on model size in scaling laws, we demonstrate that this "effective parameter" scaling pattern extends to quantization as well. Specifically, we establish that weight-only quantization achieves strong parameter efficiency multipliers, while full quantization of both weights and activations shows diminishing returns at lower bitwidths. Our results suggest that different compression techniques can be unified under a common scaling law framework, enabling principled comparison and combination of these methods.
中文: 研究表明,权重稀疏化和量化技术均遵循统一的缩放定律框架,它们作为模型规模的乘数发挥作用,其中仅权重量化展现出较强的参数效率,而全量化在较低比特位宽时收益递减。
English: This study demonstrates that both weight sparsity and quantization techniques follow a unified scaling law framework, where they act as multipliers on model size, with weight-only quantization showing strong parameter efficiency while full quantization exhibits diminishing returns at lower bitwidths.

Authors:Feibo Jiang, Siwei Tu, Li Dong, Kezhi Wang, Kun Yang, Ruiqi Liu, Cunhua Pan, Jiangzhou Wang
Title: Lightweight Vision Model-based Multi-user Semantic Communication Systems
Abstract:
Semantic Communication (SemCom) is a promising new paradigm for next-generation communication systems, emphasizing the transmission of core information, particularly in environments characterized by uncertainty, noise, and bandwidth constraints. However, existing image SemCom systems face several challenges, such as inefficient knowledge base construction, insufficient semantic encoding, and lack of multi-user semantic sharing. To address these issues, we propose a Lightweight Vision Model-based Multi-user Semantic Communication System (LVM-MSC). First, we construct a Lightweight Knowledge Base (LKB) based on the fast Segment Anything Model (SAM). LKB incorporates the extensive image knowledge of the SAM model while significantly reducing the number of parameters through its convolutional architecture. Next, we design an Efficient Semantic Codec (ESC) based on the Masked AutoEncoder (MAE) architecture. ESC enhances semantic compression at both the pixel and semantic levels and implements lightweight semantic decoding tailored for user devices. Furthermore, we propose a Multi-user Semantic Sharing (MSS) transmission for the multi-user SemCom. By calculating the similarity of semantic information among different users in the sharing semantic space, we unify the transmissions of similar semantic information through broadcasting, further improving the transmission efficiency. Finally, simulation results demonstrate the feasibility and effectiveness of the proposed LVM-MSC system.
中文摘要:提出的LVM-MSC系统通过构建轻量知识库、高效语义编解码器和多用户语义共享机制,有效解决了图像语义通信中的核心难题,显著提升了传输效率。
English Summary: The proposed LVM-MSC system addresses key challenges in image semantic communication by implementing a lightweight knowledge base, efficient semantic codec, and multi-user semantic sharing mechanism to enhance transmission efficiency.

Authors:Feibo Jiang, Siwei Tu, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan
Title: M4SC: An MLLM-based Multi-modal, Multi-task and Multi-user Semantic Communication System
Abstract:
Multi-modal Large Language Models (MLLMs) are capable of precisely extracting high-level semantic information from multi-modal data, enabling multi-task understanding and generation. This capability facilitates more efficient and intelligent data transmission in semantic communications. In this paper, we design a tailored MLLM for semantic communication and propose an MLLM-based Multi-modal, Multi-task and Multi-user Semantic Communication (M4SC) system. First, we utilize the Kolmogorov-Arnold Network (KAN) to achieve multi-modal alignment in MLLMs, thereby enhancing the accuracy of semantics representation in the semantic space across different modalities. Next, we introduce a multi-task fine-tuning approach based on task instruction following, which leverages a unified task instruction template to describe various semantic communication tasks, improving the MLLM's ability to follow instructions across multiple tasks. Additionally, by designing a semantic sharing mechanism, we transmit the public and private semantic information of multiple users separately, thus increasing the efficiency of semantic communication. Finally, we employ a joint KAN-LLM-channel coding strategy to comprehensively enhance the performance of the semantic communication system in complex communication environments. Experimental results validate the effectiveness and robustness of the proposed M4SC in multi-modal, multi-task, and multi-user scenarios.
中文摘要:本文针对语义通信设计了一种定制化多模态大语言模型,提出基于MLLM的M4SC系统,通过KAN网络实现多模态对齐、任务指令微调提升多任务处理能力,并采用语义共享机制与联合编码策略,有效提升了多模态多用户场景下的通信性能与鲁棒性。
English Summary: This paper introduces a tailored Multi-modal Large Language Model (MLLM) for semantic communication, proposing an M4SC system that enhances multi-modal alignment, multi-task instruction following, and multi-user semantic sharing through innovative techniques including Kolmogorov-Arnold Networks and joint coding strategies.

Authors:Kibum Kim, Kanghoon Yoon, Yeonjun In, Jaehyeong Jeon, Jinyoung Moon, Donghyun Kim, Chanyoung Park
Title: Weakly Supervised Video Scene Graph Generation via Natural Language Supervision
Abstract:
Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner, which requires all frames in a video to be annotated, thereby incurring high annotation cost compared to Image Scene Graph Generation (ImgSGG). Although the annotation cost of VidSGG can be alleviated by adopting a weakly supervised approach commonly used for ImgSGG (WS-ImgSGG) that uses image captions, there are two key reasons that hinder such a naive adoption: 1) Temporality within video captions, i.e., unlike image captions, video captions include temporal markers (e.g., before, while, then, after) that indicate time related details, and 2) Variability in action duration, i.e., unlike human actions in image captions, human actions in video captions unfold over varying duration. To address these issues, we propose a Natural Language-based Video Scene Graph Generation (NL-VSGG) framework that only utilizes the readily available video captions for training a VidSGG model. NL-VSGG consists of two key modules: Temporality-aware Caption Segmentation (TCS) module and Action Duration Variability-aware caption-frame alignment (ADV) module. Specifically, TCS segments the video captions into multiple sentences in a temporal order based on a Large Language Model (LLM), and ADV aligns each segmented sentence with appropriate frames considering the variability in action duration. Our approach leads to a significant enhancement in performance compared to simply applying the WS-ImgSGG pipeline to VidSGG on the Action Genome dataset. As a further benefit of utilizing the video captions as weak supervision, we show that the VidSGG model trained by NL-VSGG is able to predict a broader range of action classes that are not included in the training data, which makes our framework practical in reality.
中文: 提出的NL-VSGG框架通过时序感知的标题分割和动作时长变化感知的对齐模块,有效解决了弱监督视频场景图生成的局限性,在显著提升性能的同时还能预测训练数据中未包含的动作类别。
English: The proposed NL-VSGG framework addresses the limitations of weakly supervised video scene graph generation by introducing temporality-aware caption segmentation and action duration variability-aware alignment, significantly improving performance while enabling prediction of unseen action classes.

Authors:Zhilin Wang, Yafu Li, Jianhao Yan, Yu Cheng, Yue Zhang
Title: Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing
Abstract:
Dynamical systems theory provides a framework for analyzing iterative processes and evolution over time. Within such systems, repetitive transformations can lead to stable configurations, known as attractors, including fixed points and limit cycles. Applying this perspective to large language models (LLMs), which iteratively map input text to output text, provides a principled approach to characterizing long-term behaviors. Successive paraphrasing serves as a compelling testbed for exploring such dynamics, as paraphrases re-express the same underlying meaning with linguistic variation. Although LLMs are expected to explore a diverse set of paraphrases in the text space, our study reveals that successive paraphrasing converges to stable periodic states, such as 2-period attractor cycles, limiting linguistic diversity. This phenomenon is attributed to the self-reinforcing nature of LLMs, as they iteratively favour and amplify certain textual forms over others. This pattern persists with increasing generation randomness or alternating prompts and LLMs. These findings underscore inherent constraints in LLM generative capability, while offering a novel dynamical systems perspective for studying their expressive potential.
中文: 大型语言模型的连续释义会收敛到稳定的周期性状态,如2周期吸引子循环,因其自我强化特性而限制语言多样性,从动态系统视角揭示了生成能力的内在局限。
English: Successive paraphrasing by large language models converges to stable periodic states like 2-period attractor cycles, limiting linguistic diversity due to their self-reinforcing nature, which reveals inherent constraints in generative capability from a dynamical systems perspective.

Authors:Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni
Title: CrossOver: 3D Scene Cross-Modal Alignment
Abstract:
Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities -- RGB images, point clouds, CAD models, floorplans, and text descriptions -- with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting the adaptability for real-world applications in 3D scene understanding.
中文:CrossOver是一种灵活的跨模态3D场景理解框架,通过无需严格数据对齐或对象语义的方式学习统一嵌入空间,即使在数据不完整时也能实现稳健的场景检索和物体定位。
English: CrossOver is a flexible framework for cross-modal 3D scene understanding that learns a unified embedding space by aligning various modalities without strict data alignment or object semantics, demonstrating robust performance in scene retrieval and object localization even with incomplete data.

Authors:Sarvin Moradi, Gerben I. Beintema, Nick Jaensson, Roland Tóth, Maarten Schoukens
Title: Port-Hamiltonian Neural Networks with Output Error Noise Models
Abstract:
Hamiltonian neural networks (HNNs) represent a promising class of physics-informed deep learning methods that utilize Hamiltonian theory as foundational knowledge within neural networks. However, their direct application to engineering systems is often challenged by practical issues, including the presence of external inputs, dissipation, and noisy measurements. This paper introduces a novel framework that enhances the capabilities of HNNs to address these real-life factors. We integrate port-Hamiltonian theory into the neural network structure, allowing for the inclusion of external inputs and dissipation, while mitigating the impact of measurement noise through an output-error (OE) model structure. The resulting output error port-Hamiltonian neural networks (OE-pHNNs) can be adapted to tackle modeling complex engineering systems with noisy measurements. Furthermore, we propose the identification of OE-pHNNs based on the subspace encoder approach (SUBNET), which efficiently approximates the complete simulation loss using subsections of the data and uses an encoder function to predict initial states. By integrating SUBNET with OE-pHNNs, we achieve consistent models of complex engineering systems under noisy measurements. In addition, we perform a consistency analysis to ensure the reliability of the proposed data-driven model learning method. We demonstrate the effectiveness of our approach on system identification benchmarks, showing its potential as a powerful tool for modeling dynamic systems in real-world applications.
中文: 本文提出输出误差端口哈密顿神经网络(OE-pHNNs),通过融合端口哈密顿理论处理外部输入和耗散问题,结合SUBNET方法在噪声环境下实现对复杂工程系统的可靠建模。
English: This paper introduces output error port-Hamiltonian neural networks (OE-pHNNs) that integrate port-Hamiltonian theory to handle external inputs and dissipation while mitigating measurement noise, enhanced by the SUBNET approach for consistent modeling of complex engineering systems under noisy conditions.

Authors:Chang Liu, Yuwen Yang, Yue Ding, Hongtao Lu, Wenqing Lin, Ziming Wu, Wendong Bi
Title: DAG: Deep Adaptive and Generative $K$-Free Community Detection on Attributed Graphs
Abstract:
Community detection on attributed graphs with rich semantic and topological information offers great potential for real-world network analysis, especially user matching in online games. Graph Neural Networks (GNNs) have recently enabled Deep Graph Clustering (DGC) methods to learn cluster assignments from semantic and topological information. However, their success depends on the prior knowledge related to the number of communities $K$, which is unrealistic due to the high costs and privacy issues of acquisition.In this paper, we investigate the community detection problem without prior $K$, referred to as $K$-Free Community Detection problem. To address this problem, we propose a novel Deep Adaptive and Generative model~(DAG) for community detection without specifying the prior $K$. DAG consists of three key components, \textit{i.e.,} a node representation learning module with masked attribute reconstruction, a community affiliation readout module, and a community number search module with group sparsity. These components enable DAG to convert the process of non-differentiable grid search for the community number, \textit{i.e.,} a discrete hyperparameter in existing DGC methods, into a differentiable learning process. In such a way, DAG can simultaneously perform community detection and community number search end-to-end. To alleviate the cost of acquiring community labels in real-world applications, we design a new metric, EDGE, to evaluate community detection methods even when the labels are not feasible. Extensive offline experiments on five public datasets and a real-world online mobile game dataset demonstrate the superiority of our DAG over the existing state-of-the-art (SOTA) methods. DAG has a relative increase of 7.35\% in teams in a Tencent online game compared with the best competitor.
Chinese: 本文提出了一种新颖的深度自适应生成模型(DAG),无需预先指定社区数量即可实现社区检测,将离散搜索过程转化为可微分端到端学习框架,并在实验中证明其性能优于现有最优方法。
English: This paper introduces a novel Deep Adaptive and Generative model (DAG) that enables community detection without requiring prior knowledge of the number of communities, transforming the discrete search process into a differentiable end-to-end learning framework and demonstrating superior performance over existing methods.

Authors:Shuyong Gao, Yu'ang Feng, Qishan Wang, Lingyi Hong, Xinyu Zhou, Liu Fei, Yan Wang, Wenqiang Zhang
Title: MSVCOD:A Large-Scale Multi-Scene Dataset for Video Camouflage Object Detection
Abstract:
Video Camouflaged Object Detection (VCOD) is a challenging task which aims to identify objects that seamlessly concealed within the background in videos. The dynamic properties of video enable detection of camouflaged objects through motion cues or varied perspectives. Previous VCOD datasets primarily contain animal objects, limiting the scope of research to wildlife scenarios. However, the applications of VCOD extend beyond wildlife and have significant implications in security, art, and medical fields. Addressing this problem, we construct a new large-scale multi-domain VCOD dataset MSVCOD. To achieve high-quality annotations, we design a semi-automatic iterative annotation pipeline that reduces costs while maintaining annotation accuracy. Our MSVCOD is the largest VCOD dataset to date, introducing multiple object categories including human, animal, medical, and vehicle objects for the first time, while also expanding background diversity across various environments. This expanded scope increases the practical applicability of the VCOD task in camouflaged object detection. Alongside this dataset, we introduce a one-steam video camouflage object detection model that performs both feature extraction and information fusion without additional motion feature fusion modules. Our framework achieves state-of-the-art results on the existing VCOD animal dataset and the proposed MSVCOD. The dataset and code will be made publicly available.
中文: 本研究提出了迄今最大的多领域视频伪装目标检测数据集MSVCOD,涵盖多种目标类别和背景环境,同时开发了无需复杂运动特征融合的单流检测模型,在多个数据集上实现了最优性能。
English: This study introduces MSVCOD, the largest multi-domain video camouflaged object detection dataset with diverse object categories and backgrounds, along with a streamlined one-stream model that achieves state-of-the-art performance without complex motion fusion modules.

Authors:Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg
Title: Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning
Abstract:
Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
中文: 本文提出一种将编程基准转化为评分数据集以评估合成验证器的方法,发布了四个新基准,并证明推理能改进测试用例生成,同时增加测试用例数量可提升验证准确性。
English: This paper introduces a method to convert coding benchmarks into scoring datasets for evaluating synthetic verifiers, releasing four new benchmarks and demonstrating that reasoning improves test case generation while scaling test cases boosts verification accuracy.

Authors:Danli Shi, Bowen Liu, Zhen Tian, Yue Wu, Jiancheng Yang, Ruoyu Chen, Bo Yang, Ou Xiao, Mingguang He
Title: Fundus2Globe: Generative AI-Driven 3D Digital Twins for Personalized Myopia Management
Abstract:
Myopia, projected to affect 50% population globally by 2050, is a leading cause of vision loss. Eyes with pathological myopia exhibit distinctive shape distributions, which are closely linked to the progression of vision-threatening complications. Recent understanding of eye-shape-based biomarkers requires magnetic resonance imaging (MRI), however, it is costly and unrealistic in routine ophthalmology clinics. We present Fundus2Globe, the first AI framework that synthesizes patient-specific 3D eye globes from ubiquitous 2D color fundus photographs (CFPs) and routine metadata (axial length, spherical equivalent), bypassing MRI dependency. By integrating a 3D morphable eye model (encoding biomechanical shape priors) with a latent diffusion model, our approach achieves submillimeter accuracy in reconstructing posterior ocular anatomy efficiently. Fundus2Globe uniquely quantifies how vision-threatening lesions (e.g., staphylomas) in CFPs correlate with MRI-validated 3D shape abnormalities, enabling clinicians to simulate posterior segment changes in response to refractive shifts. External validation demonstrates its robust generation performance, ensuring fairness across underrepresented groups. By transforming 2D fundus imaging into 3D digital replicas of ocular structures, Fundus2Globe is a gateway for precision ophthalmology, laying the foundation for AI-driven, personalized myopia management.
中文摘要:Fundus2Globe作为首创的人工智能系统,能通过普通眼底照片和常规数据生成精准的3D眼球模型,无需依赖核磁共振即可量化近视相关眼形异常,为精准眼科诊疗开辟了新途径。
English Summary: Fundus2Globe is an innovative AI framework that generates precise 3D eye models from standard 2D fundus images and basic clinical data, eliminating the need for costly MRI scans while enabling accurate analysis of myopia-related ocular deformities.

Authors:Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, Kenshi Abe, Mitsuki Sakamoto, Eiji Uchibe
Title: Evaluation of Best-of-N Sampling Strategies for Language Model Alignment
Abstract:
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Since the reward model is an imperfect proxy for the true objective, an excessive focus on optimizing its value can lead to a compromise of its performance on the true objective. Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling so that it mitigates reward hacking and empirically (Jinnai et al., 2024). However, Jinnai et al. (2024) introduce RBoN based on a heuristic and they lack the analysis of why such regularization strategy improves the performance of BoN sampling. The aim of this study is to analyze the effect of BoN sampling on regularization strategies. Using the regularization strategies corresponds to robust optimization, which maximizes the worst case over a set of possible perturbations in the proxy reward. Although the theoretical guarantees are not directly applicable to RBoN, RBoN corresponds to a practical implementation. This paper proposes an extension of the RBoN framework, called Stochastic RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN in proxy reward. We then perform an empirical evaluation using the AlpacaFarm and Anthropic's hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward. In addition, we also propose another simple RBoN method, the Sentence Length Regularized BoN, which has a better performance in the experiment as compared to the previous methods.
中文: 通过引入正则化的最佳N采样方法,如随机RBoN和句子长度正则化BoN,能有效缓解大型语言模型中的奖励破解问题,使代理奖励更贴近真实目标,提升模型性能。
English: Best-of-N sampling with regularization, such as the proposed Stochastic RBoN and Sentence Length Regularized BoN, effectively mitigates reward hacking in Large Language Models by aligning proxy rewards with true objectives through robust optimization strategies.

Authors:Longfei Yun, Letian Peng, Jingbo Shang
Title: UltraGen: Extremely Fine-grained Controllable Generation via Attribute Reconstruction and Global Preference Optimization
Abstract:
Fine granularity is an essential requirement for controllable text generation, which has seen rapid growth with the ability of LLMs. However, existing methods focus mainly on a small set of attributes like 3 to 5, and their performance degrades significantly when the number of attributes increases to the next order of magnitude. To address this challenge, we propose a novel zero-shot approach for extremely fine-grained controllable generation (EFCG), proposing auto-reconstruction (AR) and global preference optimization (GPO). In the AR phase, we leverage LLMs to extract soft attributes (e.g., Emphasis on simplicity and minimalism in design) from raw texts, and combine them with programmatically derived hard attributes (e.g., The text should be between 300 and 400 words) to construct massive (around 45) multi-attribute requirements, which guide the fine-grained text reconstruction process under weak supervision. In the GPO phase, we apply direct preference optimization (DPO) to refine text generation under diverse attribute combinations, enabling efficient exploration of the global combination space. Additionally, we introduce an efficient attribute sampling strategy to identify and correct potentially erroneous attributes, further improving global optimization. Our framework significantly improves the constraint satisfaction rate (CSR) and text quality for EFCG by mitigating position bias and alleviating attention dilution.
Chinese: 针对现有可控文本生成方法在处理大量属性时的性能下降问题,我们提出了EFCG零样本框架,通过自动重建和全局偏好优化的结合,有效缓解位置偏差和注意力稀释,显著提升了约束满足率和文本质量。
English: To address the limitations of existing methods in handling large numbers of attributes for controllable text generation, we propose a zero-shot framework called EFCG that combines auto-reconstruction and global preference optimization to significantly improve constraint satisfaction and text quality by mitigating position bias and attention dilution.

Authors:Hanzhuo Huang, Yuan Liu, Ge Zheng, Jiepeng Wang, Zhiyang Dou, Sibei Yang
Title: MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow
Abstract:
In this paper, we present MVTokenFlow for high-quality 4D content creation from monocular videos. Recent advancements in generative models such as video diffusion models and multiview diffusion models enable us to create videos or 3D models. However, extending these generative models for dynamic 4D content creation is still a challenging task that requires the generated content to be consistent spatially and temporally. To address this challenge, MVTokenFlow utilizes the multiview diffusion model to generate multiview images on different timesteps, which attains spatial consistency across different viewpoints and allows us to reconstruct a reasonable coarse 4D field. Then, MVTokenFlow further regenerates all the multiview images using the rendered 2D flows as guidance. The 2D flows effectively associate pixels from different timesteps and improve the temporal consistency by reusing tokens in the regeneration process. Finally, the regenerated images are spatiotemporally consistent and utilized to refine the coarse 4D field to get a high-quality 4D field. Experiments demonstrate the effectiveness of our design and show significantly improved quality than baseline methods.
中文:MVTokenFlow提出了一种从单目视频生成高质量4D内容的新方法,通过首先生成空间一致的多视角图像,再利用令牌复用技术优化时间一致性,最终显著超越了现有基线方法的质量表现。
English: MVTokenFlow introduces a novel method for generating high-quality 4D content from monocular videos by first creating spatially consistent multiview images and then refining them with temporal consistency through token reuse, outperforming existing baseline methods.

Authors:Jiaze Li, Yaya Shi, Zongyang Ma, Haoran Xu, Feng Cheng, Huihui Xiao, Ruiwen Kang, Fan Yang, Tingting Gao, Di Zhang
Title: iMOVE: Instance-Motion-Aware Video Understanding
Abstract:
Enhancing the fine-grained instance spatiotemporal motion perception capabilities of Video Large Language Models is crucial for improving their temporal and general video understanding. However, current models struggle to perceive detailed and complex instance motions. To address these challenges, we have made improvements from both data and model perspectives. In terms of data, we have meticulously curated iMOVE-IT, the first large-scale instance-motion-aware video instruction-tuning dataset. This dataset is enriched with comprehensive instance motion annotations and spatiotemporal mutual-supervision tasks, providing extensive training for the model's instance-motion-awareness. Building on this foundation, we introduce iMOVE, an instance-motion-aware video foundation model that utilizes Event-aware Spatiotemporal Efficient Modeling to retain informative instance spatiotemporal motion details while maintaining computational efficiency. It also incorporates Relative Spatiotemporal Position Tokens to ensure awareness of instance spatiotemporal positions. Evaluations indicate that iMOVE excels not only in video temporal understanding and general video understanding but also demonstrates significant advantages in long-term video understanding.
中文: 本研究提出了iMOVE模型,通过构建专用数据集和采用高效建模技术,增强了视频大语言模型对细粒度实例时空运动的感知能力,在时序理解和长视频理解方面展现出卓越性能。
English: This study introduces iMOVE, an instance-motion-aware video foundation model enhanced with a curated dataset and efficient modeling techniques to improve fine-grained spatiotemporal motion perception in Video Large Language Models, demonstrating superior performance in temporal and long-term video understanding.

Authors:Letian Peng, Zilong Wang, Feng Yao, Jingbo Shang
Title: Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
Abstract:
Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.
中文摘要:提出的下一令牌提取(NTE)范式通过将预测任务转化为提取任务,使信息抽取模型能够利用大语言模型资源,由此开发的Cuckoo模型在少样本场景下表现优于现有方法,并能随大语言模型发展自动演进。
English Summary: The proposed next tokens extraction (NTE) paradigm enables information extraction models to leverage large language model resources by converting prediction tasks into extraction tasks, resulting in the Cuckoo model that outperforms existing methods while evolving automatically with LLM advancements.

Authors:Runtian Yuan, Mohan Chen, Jilan Xu, Ling Zhou, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao
Title: Text-Promptable Propagation for Referring Medical Image Sequence Segmentation
Abstract:
Referring Medical Image Sequence Segmentation (Ref-MISS) is a novel and challenging task that aims to segment anatomical structures in medical image sequences (\emph{e.g.} endoscopy, ultrasound, CT, and MRI) based on natural language descriptions. This task holds significant clinical potential and offers a user-friendly advancement in medical imaging interpretation. Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences, and lack support for nteractive, text-driven guidance. To address these limitations, we propose Text-Promptable Propagation (TPP), a model designed for referring medical image sequence segmentation. TPP captures the intrinsic relationships among sequential images along with their associated textual descriptions. Specifically, it enables the recognition of referred objects through cross-modal referring interaction, and maintains continuous tracking across the sequence via Transformer-based triple propagation, using text embeddings as queries. To support this task, we curate a large-scale benchmark, Ref-MISS-Bench, which covers 4 imaging modalities and 20 different organs and lesions. Experimental results on this benchmark demonstrate that TPP consistently outperforms state-of-the-art methods in both medical segmentation and referring video object segmentation.
中文摘要:Ref-MISS是一项通过自然语言描述分割医学图像序列中解剖结构的新任务,提出的TPP模型结合跨模态交互和基于Transformer的传播机制,在涵盖多模态的Ref-MISS-Bench基准测试中显著优于现有先进方法。
English Summary: Ref-MISS is a new medical imaging task that segments anatomical structures in image sequences using natural language, addressed by the proposed TPP model which integrates cross-modal interactions and transformer-based propagation to outperform existing methods.

Authors:Sanggeon Yun, Ryozo Masukawa, Hanning Chen, SungHeon Jeong, Wenjun Huang, Arghavan Rezvani, Minhyoung Na, Yoshiki Yamaguchi, Mohsen Imani
Title: Hyperdimensional Intelligent Sensing for Efficient Real-Time Audio Processing on Extreme Edge
Abstract:
The escalating challenges of managing vast sensor-generated data, particularly in audio applications, necessitate innovative solutions. Current systems face significant computational and storage demands, especially in real-time applications like gunshot detection systems (GSDS), and the proliferation of edge sensors exacerbates these issues. This paper proposes a groundbreaking approach with a near-sensor model tailored for intelligent audio-sensing frameworks. Utilizing a Fast Fourier Transform (FFT) module, convolutional neural network (CNN) layers, and HyperDimensional Computing (HDC), our model excels in low-energy, rapid inference, and online learning. It is highly adaptable for efficient ASIC design implementation, offering superior energy efficiency compared to conventional embedded CPUs or GPUs, and is compatible with the trend of shrinking microphone sensor sizes. Comprehensive evaluations at both software and hardware levels underscore the model's efficacy. Software assessments through detailed ROC curve analysis revealed a delicate balance between energy conservation and quality loss, achieving up to 82.1% energy savings with only 1.39% quality loss. Hardware evaluations highlight the model's commendable energy efficiency when implemented via ASIC design, especially with the Google Edge TPU, showcasing its superiority over prevalent embedded CPUs and GPUs.
Chinese: 本文提出了一种创新的近传感器智能音频感知模型,结合FFT、CNN和超维计算,实现了高能效和快速推理且质量损失极小,在软硬件评估中均展现出显著优势。
English: This paper introduces a novel near-sensor model for intelligent audio sensing that integrates FFT, CNN, and HyperDimensional Computing to achieve high energy efficiency and rapid inference with minimal quality loss, demonstrating significant advantages in both software and hardware evaluations.

Authors:Yuntao Wang, Qinnan Hu, Zhou Su, Linkang Du, Qichao Xu, Weiwei Li
Title: Large Model Empowered Metaverse: State-of-the-Art, Challenges and Opportunities
Abstract:
The Metaverse represents a transformative shift beyond traditional mobile Internet, creating an immersive, persistent digital ecosystem where users can interact, socialize, and work within 3D virtual environments. Powered by large models such as ChatGPT and Sora, the Metaverse benefits from precise large-scale real-world modeling, automated multimodal content generation, realistic avatars, and seamless natural language understanding, which enhance user engagement and enable more personalized, intuitive interactions. However, challenges remain, including limited scalability, constrained responsiveness, and low adaptability in dynamic environments. This paper investigates the integration of large models within the Metaverse, examining their roles in enhancing user interaction, perception, content creation, and service quality. To address existing challenges, we propose a generative AI-based framework for optimizing Metaverse rendering. This framework includes a cloud-edge-end collaborative model to allocate rendering tasks with minimal latency, a mobility-aware pre-rendering mechanism that dynamically adjusts to user movement, and a diffusion model-based adaptive rendering strategy to fine-tune visual details. Experimental results demonstrate the effectiveness of our approach in enhancing rendering efficiency and reducing rendering overheads, advancing large model deployment for a more responsive and immersive Metaverse.
中文摘要:本文研究大模型如何通过提升交互和内容创作来增强元宇宙体验,提出一种生成式AI框架,结合云边端协同与自适应渲染,以提高效率并降低渲染开销。
English Summary: This paper explores how large models enhance the Metaverse by improving interactions and content creation, proposing a generative AI framework with cloud-edge collaboration and adaptive rendering to boost efficiency and reduce overhead.

Authors:Albina Klepach, Alexander Nikulin, Ilya Zisman, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Nikita Lyubaykin, Vladislav Kurenkov
Title: Object-Centric Latent Action Learning
Abstract:
Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy-action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle action-related and distracting dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).
中文: 我们提出的以物体为中心的潜在动作学习框架通过关注任务相关的物体交互,有效克服了未标记视频中的视觉干扰,在八项复杂任务中将模仿学习性能提升了50%。
English: Our proposed object-centric latent action learning framework overcomes visual distractors in unlabeled videos by focusing on task-relevant object interactions, improving imitation learning performance by 50% across eight complex tasks.

Authors:Xiaodong Li, Ruochen Yang, Shuang Wen, Shen Wang, Yueyang Liu, Guoquan Wang, Weisong Hu, Qiang Luo, Jiawei Sheng, Tingwen Liu, Jiangxia Cao, Shuang Yang, Zhaojie Liu
Title: FARM: Frequency-Aware Model for Cross-Domain Live-Streaming Recommendation
Abstract:
Live-streaming services have attracted widespread popularity due to their real-time interactivity and entertainment value. Users can engage with live-streaming authors by participating in live chats, posting likes, or sending virtual gifts to convey their preferences and support. However, the live-streaming services faces serious data-sparsity problem, which can be attributed to the following two points: (1) User's valuable behaviors are usually sparse, e.g., like, comment and gift, which are easily overlooked by the model, making it difficult to describe user's personalized preference. (2) The main exposure content on our platform is short-video, which is 9 times higher than the exposed live-streaming, leading to the inability of live-streaming content to fully model user preference. To this end, we propose a Frequency-Aware Model for Cross-Domain Live-Streaming Recommendation, termed as FARM. Specifically, we first present the intra-domain frequency aware module to enable our model to perceive user's sparse yet valuable behaviors, i.e., high-frequency information, supported by the Discrete Fourier Transform (DFT). To transfer user preference across the short-video and live-streaming domains, we propose a novel preference align before fuse strategy, which consists of two parts: the cross-domain preference align module to align user preference in both domains with contrastive learning, and the cross-domain preference fuse module to further fuse user preference in both domains using a serious of tailor-designed attention mechanisms. Extensive offline experiments and online A/B testing on Kuaishou live-streaming services demonstrate the effectiveness and superiority of FARM. Our FARM has been deployed in online live-streaming services and currently serves hundreds of millions of users on Kuaishou.
中文: 直播服务因用户行为稀疏和短视频内容主导而面临数据稀疏问题,为此提出的频率感知模型FARM通过离散傅里叶变换捕捉高频交互,并采用跨域偏好对齐与融合策略,已在快手平台成功部署并服务数亿用户。
English: Live-streaming services suffer from data sparsity due to sparse user interactions and overwhelming short-video exposure, prompting the development of FARM, a frequency-aware model that leverages Discrete Fourier Transform and cross-domain alignment to enhance recommendation accuracy, which has proven effective in large-scale deployment.

Authors:Yan Zhang, Wen Yang, Chang Xu, Qian Hu, Fang Xu, Gui-Song Xia
Title: Mitigating the Impact of Prominent Position Shift in Drone-based RGBT Object Detection
Abstract:
Drone-based RGBT object detection plays a crucial role in many around-the-clock applications. However, real-world drone-viewed RGBT data suffers from the prominent position shift problem, i.e., the position of a tiny object differs greatly in different modalities. For instance, a slight deviation of a tiny object in the thermal modality will induce it to drift from the main body of itself in the RGB modality. Considering RGBT data are usually labeled on one modality (reference), this will cause the unlabeled modality (sensed) to lack accurate supervision signals and prevent the detector from learning a good representation. Moreover, the mismatch of the corresponding feature point between the modalities will make the fused features confusing for the detection head. In this paper, we propose to cast the cross-modality box shift issue as the label noise problem and address it on the fly via a novel Mean Teacher-based Cross-modality Box Correction head ensemble (CBC). In this way, the network can learn more informative representations for both modalities. Furthermore, to alleviate the feature map mismatch problem in RGBT fusion, we devise a Shifted Window-Based Cascaded Alignment (SWCA) module. SWCA mines long-range dependencies between the spatially unaligned features inside shifted windows and cascaded aligns the sensed features with the reference ones. Extensive experiments on two drone-based RGBT object detection datasets demonstrate that the correction results are both visually and quantitatively favorable, thereby improving the detection performance. In particular, our CBC module boosts the precision of the sensed modality ground truth by 25.52 aSim points. Overall, the proposed detector achieves an mAP_50 of 43.55 points on RGBTDronePerson and surpasses a state-of-the-art method by 8.6 mAP50 on a shift subset of DroneVehicle dataset. The code and data will be made publicly available.
中文: 本文针对无人机RGB-T目标检测中的跨模态框偏移问题,提出基于均值教师模型的跨模态框校正(CBC)头和移位窗口级联对齐(SWCA)模块,通过修正标签噪声和对齐跨模态特征,有效提升了检测精度。
English: This paper addresses the cross-modality box shift issue in drone-based RGBT object detection by proposing a Mean Teacher-based Cross-modality Box Correction (CBC) head and a Shifted Window-Based Cascaded Alignment (SWCA) module, which improve detection accuracy by correcting label noise and aligning features across modalities.

Authors:Xinyi Gao, Dongting Xie, Yihang Zhang, Zhengren Wang, Chong Chen, Conghui He, Hongzhi Yin, Wentao Zhang
Title: A Comprehensive Survey on Imbalanced Data Learning
Abstract:
With the expansion of data availability, machine learning (ML) has achieved remarkable breakthroughs in both academia and industry. However, imbalanced data distributions are prevalent in various types of raw data and severely hinder the performance of ML by biasing the decision-making processes. To deepen the understanding of imbalanced data and facilitate the related research and applications, this survey systematically analyzes various real-world data formats and concludes existing researches for different data formats into four distinct categories: data re-balancing, feature representation, training strategy, and ensemble learning. This structured analysis helps researchers comprehensively understand the pervasive nature of imbalance across diverse data formats, thereby paving a clearer path toward achieving specific research goals. We provide an overview of relevant open-source libraries, spotlight current challenges, and offer novel insights aimed at fostering future advancements in this critical area of study.
中文: 本综述系统地将不同数据格式下的不平衡数据处理研究归纳为四类方法,在概述开源工具与现存挑战的同时,为未来研究方向提供了创新见解。
English: This survey systematically categorizes and analyzes machine learning approaches for handling imbalanced data across various formats, highlighting four key methodologies while identifying current challenges and future research directions.

Authors:Mo Yu, Lemao Liu, Junjie Wu, Tsz Ting Chung, Shunchi Zhang, Jiangnan Li, Dit-Yan Yeung, Jie Zhou
Title: The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding
Abstract:
In a systematic way, we investigate a widely asked question: Do LLMs really understand what they say?, which relates to the more familiar term Stochastic Parrot. To this end, we propose a summative assessment over a carefully designed physical concept understanding task, PhysiCo. Our task alleviates the memorization issue via the usage of grid-format inputs that abstractly describe physical phenomena. The grids represents varying levels of understanding, from the core phenomenon, application examples to analogies to other abstract patterns in the grid world. A comprehensive study on our task demonstrates: (1) state-of-the-art LLMs, including GPT-4o, o1 and Gemini 2.0 flash thinking, lag behind humans by ~40%; (2) the stochastic parrot phenomenon is present in LLMs, as they fail on our grid task but can describe and recognize the same concepts well in natural language; (3) our task challenges the LLMs due to intrinsic difficulties rather than the unfamiliar grid format, as in-context learning and fine-tuning on same formatted data added little to their performance.
中文摘要:本研究通过设计PhysiCo网格物理概念任务系统探讨大语言模型是否真正理解其输出,发现GPT-4o等顶尖模型表现落后人类约40%,且在网格任务中失败却擅长自然语言描述,证实了“随机鹦鹉”现象的存在。
English Summary: This study systematically investigates whether LLMs genuinely understand their output by designing PhysiCo, a grid-based physical concept task that minimizes memorization, revealing that top LLMs like GPT-4o significantly trail humans by about 40% and exhibit stochastic parrot behavior by failing grid tasks despite natural language proficiency.

Authors:Wittawat Jitkrittum, Harikrishna Narasimhan, Ankit Singh Rawat, Jeevesh Juneja, Congchao Wang, Zifeng Wang, Alec Go, Chen-Yu Lee, Pradeep Shenoy, Rina Panigrahy, Aditya Krishna Menon, Sanjiv Kumar
Title: Universal Model Routing for Efficient LLM Inference
Abstract:
Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.
Chinese: UniRoute 是一种动态路由方法,通过将大型语言模型表示为特征向量,能够高效地将提示路由到最小合适的模型,即使在未见过的模型中也能实现,在超过30个LLM的基准测试中展现了有效性。
English: UniRoute is a dynamic routing approach that represents large language models as feature vectors to efficiently route prompts to the smallest suitable model, even among previously unseen models, demonstrating effectiveness across over 30 LLMs in benchmarks.

Authors:Ioannis Anagnostides, Ioannis Panageas, Tuomas Sandholm, Jingming Yan
Title: The Complexity of Symmetric Equilibria in Min-Max Optimization and Team Zero-Sum Games
Abstract:
We consider the problem of computing stationary points in min-max optimization, with a particular focus on the special case of computing Nash equilibria in (two-)team zero-sum games. We first show that computing $ε$-Nash equilibria in $3$-player \emph{adversarial} team games -- wherein a team of $2$ players competes against a \emph{single} adversary -- is \textsf{CLS}-complete, resolving the complexity of Nash equilibria in such settings. Our proof proceeds by reducing from \emph{symmetric} $ε$-Nash equilibria in \emph{symmetric}, identical-payoff, two-player games, by suitably leveraging the adversarial player so as to enforce symmetry -- without disturbing the structure of the game. In particular, the class of instances we construct comprises solely polymatrix games, thereby also settling a question left open by Hollender, Maystre, and Nagarajan (2024). We also provide some further results concerning equilibrium computation in adversarial team games. Moreover, we establish that computing \emph{symmetric} (first-order) equilibria in \emph{symmetric} min-max optimization is \textsf{PPAD}-complete, even for quadratic functions. Building on this reduction, we further show that computing symmetric $ε$-Nash equilibria in symmetric, $6$-player ($3$ vs. $3$) team zero-sum games is also \textsf{PPAD}-complete, even for $ε= \text{poly}(1/n)$. As an immediate corollary, this precludes the existence of symmetric dynamics -- which includes many of the algorithms considered in the literature -- converging to stationary points. Finally, we prove that computing a \emph{non-symmetric} $\text{poly}(1/n)$-equilibrium in symmetric min-max optimization is \textsf{FNP}-hard.
中文: 本研究证明了在3玩家对抗性团队博弈中计算ε-纳什均衡是CLS完全问题,而在对称极小极大优化和6玩家团队零和博弈中求解对称均衡是PPAD完全问题,揭示了这些场景下均衡计算存在的计算复杂性障碍。
English: This study demonstrates that computing ε-Nash equilibria in 3-player adversarial team games is CLS-complete, while finding symmetric equilibria in symmetric min-max optimization and 6-player team zero-sum games is PPAD-complete, revealing computational barriers for equilibrium computation in these settings.

Authors:Ziyao Wang, Muneeza Azmat, Ang Li, Raya Horesh, Mikhail Yurochkin
Title: Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding
Abstract:
Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to improve their performance across domains. To realize this potential, we introduce a novel Collaborative Speculative Decoding (CoSD) algorithm that enables efficient LLM knowledge fusion at test time without requiring additional model training. CoSD employs a draft model to generate initial sequences and an easy-to-learn rule or decision tree to decide when to invoke an assistant model to improve these drafts. CoSD not only enhances knowledge fusion but also improves inference efficiency, is transferable across domains and models, and offers greater explainability. Experimental results demonstrate that CoSD improves accuracy by up to 10\% across benchmarks compared to existing methods, providing a scalable and effective solution for LLM-based applications
中文: 提出的协同推测解码(CoSD)算法无需额外训练即可在推理阶段实现大语言模型的高效知识融合,在各类基准测试中准确率最高提升10%,同时增强了可解释性与跨领域迁移能力。
English: The proposed Collaborative Speculative Decoding (CoSD) algorithm enables efficient knowledge fusion among Large Language Models during inference without retraining, improving accuracy by up to 10% across benchmarks while enhancing interpretability and transferability.

Authors:Italo Santos, Katia Romero Felizardo, Anita Sarma, Igor Steinmacher, Marco A. Gerosa
Title: OSSDoorway: A Gamified Environment to Scaffold Student Contributions to Open Source Software
Abstract:
Software engineering courses enable practical learning through assignments requiring contributions to open source software (OSS), allowing students to experience real-world projects, collaborate with global communities, and develop skills and competencies required to succeed in the tech industry. Learning software engineering through open source contribution integrates theory with hands-on practice, as students tackle real challenges in collaborative environments. However, students often struggle to contribute to OSS projects and do not understand the contribution process. Research has demonstrated that strategically incorporating game elements can promote student learning and engagement. This paper proposes and evaluates OSSDoorway, a tool designed to guide students contributing to OSS projects. We recruited 29 students and administered a self-efficacy questionnaire before and after their use of OSSDoorway, along with qualitative feedback to assess challenges, interface features, and suggestions for improvement. The results show that OSSDoorway boosts students' self-efficacy and provides a structured, gamified learning experience. Clear instructions, real-time feedback, and the quest-based system helped students navigate tasks like using GitHub features to submit pull requests and collaborating with the community. Our findings suggest that providing students with a supportive gamified environment that uses feedback and structured quests can help them navigate the OSS contribution process.
中文: OSSDoorway 是一款游戏化工具,通过结构化任务和实时反馈增强学生的自我效能感,指导他们完成开源贡献流程,有效应对软件工程教育中的实践挑战。
English: OSSDoorway is a gamified tool that enhances students' self-efficacy and guides them through the open source contribution process with structured quests and real-time feedback, addressing challenges in software engineering education.

Authors:Jian Yang, Wei Zhang, Jiaxi Yang, Yibo Miao, Shanghaoran Quan, Zhenhe Wu, Qiyao Peng, Liqun Yang, Tianyu Liu, Zeyu Cui, Binyuan Hui, Junyang Lin
Title: Multi-Agent Collaboration for Multilingual Code Instruction Tuning
Abstract:
Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.
Chinese: 一种新颖的多智能体协作框架通过语言特定代理间的讨论和记忆共享,实现了跨编程语言的知识迁移,从而增强了代码大语言模型的多语言指令调优,使得Qwen2.5-xCoder在多语言基准测试中表现出卓越性能。
English: A novel multi-agent collaboration framework enhances multilingual instruction tuning for code LLMs by enabling language-specific agents to transfer knowledge across programming languages through discussion and memory sharing, resulting in Qwen2.5-xCoder's superior performance on multilingual benchmarks.

Authors:Junjie Wu, Mo Yu, Lemao Liu, Dit-Yan Yeung, Jie Zhou
Title: Understanding LLMs' Fluid Intelligence Deficiency: An Analysis of the ARC Task
Abstract:
While LLMs have exhibited strong performance on various NLP tasks, it is noteworthy that most of these tasks rely on utilizing the vast amount of knowledge encoded in LLMs' parameters, rather than solving new problems without prior knowledge. In cognitive research, the latter ability is referred to as fluid intelligence, which is considered to be critical for assessing human intelligence. Recent research on fluid intelligence assessments has highlighted significant deficiencies in LLMs' abilities. In this paper, we analyze the challenges LLMs face in demonstrating fluid intelligence through controlled experiments, using the most representative ARC task as an example. Our study revealed three major limitations in existing LLMs: limited ability for skill composition, unfamiliarity with abstract input formats, and the intrinsic deficiency of left-to-right decoding. Our data and code can be found in https://wujunjie1998.github.io/araoc-benchmark.github.io/.
中文: 大语言模型在流体智力方面存在显著不足,具体表现为技能组合能力有限、对抽象输入格式不熟悉以及从左到右解码的内在缺陷,这在ARC等任务中尤为明显。
English: Large language models (LLMs) exhibit significant deficiencies in fluid intelligence, as demonstrated by their limitations in skill composition, unfamiliarity with abstract inputs, and left-to-right decoding constraints in tasks like ARC.

Authors:Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, Ryan Rossi, Franck Dernoncourt, Md Mehrab Tanjim, Nesreen Ahmed, Xiaorui Liu, Wenqi Fan, Erik Blasch, Yu Wang, Meng Jiang, Tyler Derr
Title: Towards Trustworthy Retrieval Augmented Generation for Large Language Models: A Survey
Abstract:
Retrieval-Augmented Generation (RAG) is an advanced technique designed to address the challenges of Artificial Intelligence-Generated Content (AIGC). By integrating context retrieval into content generation, RAG provides reliable and up-to-date external knowledge, reduces hallucinations, and ensures relevant context across a wide range of tasks. However, despite RAG's success and potential, recent studies have shown that the RAG paradigm also introduces new risks, including robustness issues, privacy concerns, adversarial attacks, and accountability issues. Addressing these risks is critical for future applications of RAG systems, as they directly impact their trustworthiness. Although various methods have been developed to improve the trustworthiness of RAG methods, there is a lack of a unified perspective and framework for research in this topic. Thus, in this paper, we aim to address this gap by providing a comprehensive roadmap for developing trustworthy RAG systems. We place our discussion around five key perspectives: reliability, privacy, safety, fairness, explainability, and accountability. For each perspective, we present a general framework and taxonomy, offering a structured approach to understanding the current challenges, evaluating existing solutions, and identifying promising future research directions. To encourage broader adoption and innovation, we also highlight the downstream applications where trustworthy RAG systems have a significant impact.
Chinese: 检索增强生成(RAG)通过整合外部知识提升人工智能生成内容的准确性和相关性,但也带来了鲁棒性、隐私等风险;本文为此提出一个涵盖可靠性、隐私、安全等五大视角的综合框架,以指导构建可信赖的RAG系统。
English: Retrieval-Augmented Generation (RAG) enhances AI-generated content by integrating external knowledge to reduce errors and ensure relevance, yet it introduces risks like robustness and privacy issues, prompting this paper to propose a comprehensive framework for developing trustworthy RAG systems across five key perspectives.

Authors:Hailong Jiang, Jianfeng Zhu, Yao Wan, Bo Fang, Hongyu Zhang, Ruoming Jin, Qiang Guan
Title: Can Large Language Models Understand Intermediate Representations in Compilers?
Abstract:
Intermediate Representations (IRs) play a critical role in compiler design and program analysis, yet their comprehension by Large Language Models (LLMs) remains underexplored. In this paper, we present an explorative empirical study evaluating the capabilities of six state-of-the-art LLMs: GPT-4, GPT-3, DeepSeek, Gemma 2, Llama 3, and Code Llama, in understanding IRs. Specifically, we assess model performance across four core tasks: control flow graph reconstruction, decompilation, code summarization, and execution reasoning. While LLMs exhibit competence in parsing IR syntax and identifying high-level structures, they consistently struggle with instruction-level reasoning, especially in control flow reasoning, loop handling, and dynamic execution. Common failure modes include misinterpreting branching instructions, omitting critical operations, and relying on heuristic reasoning rather than precise instruction-level logic. Our findings highlight the need for IR-specific enhancements in LLM design. We recommend fine-tuning on structured IR datasets and integrating control-flow-sensitive architectures to improve model effectiveness. All experimental data and source code are publicly available at
中文: 本研究评估了六种先进大语言模型对中间表示的理解能力,发现它们在语法解析方面表现良好,但在指令级推理(尤其是控制流和动态执行)上存在明显不足,亟需针对中间表示特性进行架构优化。
English: This study evaluates six advanced LLMs' understanding of intermediate representations, revealing their strengths in syntax parsing but significant limitations in instruction-level reasoning, particularly with control flow and dynamic execution, necessitating IR-specific architectural improvements.

Authors:Zhengyuan Shi, Chengyu Ma, Ziyang Zheng, Lingfeng Zhou, Hongyang Pan, Wentao Jiang, Fan Yang, Xiaoyan Yang, Zhufei Chu, Qiang Xu
Title: DeepCell: Self-Supervised Multiview Fusion for Circuit Representation Learning
Abstract:
We introduce DeepCell, a novel circuit representation learning framework that effectively integrates multiview information from both And-Inverter Graphs (AIGs) and Post-Mapping (PM) netlists. At its core, DeepCell employs a self-supervised Mask Circuit Modeling (MCM) strategy, inspired by masked language modeling, to fuse complementary circuit representations from different design stages into unified and rich embeddings. To our knowledge, DeepCell is the first framework explicitly designed for PM netlist representation learning, setting new benchmarks in both predictive accuracy and reconstruction quality. We demonstrate the practical efficacy of DeepCell by applying it to critical EDA tasks such as functional Engineering Change Orders (ECO) and technology mapping. Extensive experimental results show that DeepCell significantly surpasses state-of-the-art open-source EDA tools in efficiency and performance.
中文: DeepCell是一种创新的电路表示学习框架,通过自监督的掩码电路建模融合AIG和PM网表的多视角信息,在功能ECO和技术映射等EDA任务中实现了卓越的准确性与性能突破。
English: DeepCell is a pioneering circuit representation learning framework that integrates multiview data from AIGs and PM netlists using self-supervised MCM, achieving superior accuracy and performance in EDA tasks like functional ECO and technology mapping.

Authors:Dimitrios Tyrovolas, Sotiris A. Tegos, Panagiotis D. Diamantoulakis, Sotiris Ioannidis, Christos K. Liaskos, George K. Karagiannidis
Title: Performance Analysis of Pinching-Antenna Systems
Abstract:
The sixth generation of wireless networks envisions intelligent and adaptive environments capable of meeting the demands of emerging applications such as immersive extended reality, advanced healthcare, and the metaverse. However, this vision requires overcoming critical challenges, including the limitations of conventional wireless technologies in mitigating path loss and dynamically adapting to diverse user needs. Among the proposed reconfigurable technologies, pinching antenna systems (PASs) offer a novel way to turn path loss into a programmable parameter by using dielectric waveguides to minimize propagation losses at high frequencies. In this paper, we develop a comprehensive analytical framework that derives closed-form expressions for the outage probability and average rate of PASs while incorporating both free-space path loss and waveguide attenuation under realistic conditions. In addition, we characterize the optimal placement of pinching antennas to maximize performance under waveguide losses. Numerical results show the significant impact of waveguide losses on system performance, especially for longer waveguides, emphasizing the importance of accurate loss modeling. Despite these challenges, PASs consistently outperform conventional systems in terms of reliability and data rate, underscoring their potential to enable high-performance programmable wireless environments.
中文: 第六代无线网络旨在为先进应用构建智能环境,但面临路径损耗等挑战,而夹持天线系统通过将路径损耗可编程化,在可靠性和数据速率上优于传统系统。
English: The sixth-generation wireless networks aim to create intelligent environments for advanced applications but face challenges like path loss, which pinching antenna systems (PASs) address by making path loss programmable and outperforming conventional systems in reliability and data rate.

Authors:Qingshan Hou, Yukun Zhou, Jocelyn Hui Lin Goh, Ke Zou, Samantha Min Er Yew, Sahana Srinivasan, Meng Wang, Thaddaeus Lo, Xiaofeng Lei, Siegfried K. Wagner, Mark A. Chia, Dawei Yang, Hongyang Jiang, An Ran Ran, Rui Santos, Gabor Mark Somfai, Juan Helen Zhou, Haoyu Chen, Qingyu Chen, Carol Y. Cheung, Pearse A. Keane, Yih Chung Tham
Title: Is an Ultra Large Natural Image-Based Foundation Model Superior to a Retina-Specific Model for Detecting Ocular and Systemic Diseases?
Abstract:
The advent of foundation models (FMs) is transforming medical domain. In ophthalmology, RETFound, a retina-specific FM pre-trained sequentially on 1.4 million natural images and 1.6 million retinal images, has demonstrated high adaptability across clinical applications. Conversely, DINOv2, a general-purpose vision FM pre-trained on 142 million natural images, has shown promise in non-medical domains. However, its applicability to clinical tasks remains underexplored. To address this, we conducted head-to-head evaluations by fine-tuning RETFound and three DINOv2 models (large, base, small) for ocular disease detection and systemic disease prediction tasks, across eight standardized open-source ocular datasets, as well as the Moorfields AlzEye and the UK Biobank datasets. DINOv2-large model outperformed RETFound in detecting diabetic retinopathy (AUROC=0.850-0.952 vs 0.823-0.944, across three datasets, all P<=0.007) and multi-class eye diseases (AUROC=0.892 vs. 0.846, P<0.001). In glaucoma, DINOv2-base model outperformed RETFound (AUROC=0.958 vs 0.940, P<0.001). Conversely, RETFound achieved superior performance over all DINOv2 models in predicting heart failure, myocardial infarction, and ischaemic stroke (AUROC=0.732-0.796 vs 0.663-0.771, all P<0.001). These trends persisted even with 10% of the fine-tuning data. These findings showcase the distinct scenarios where general-purpose and domain-specific FMs excel, highlighting the importance of aligning FM selection with task-specific requirements to optimise clinical performance.
中文: 研究表明,通用基础模型DINOv2在检测糖尿病视网膜病变和青光眼等眼部疾病方面表现更优,而领域专用模型RETFound则在预测心力衰竭和心肌梗死等全身性疾病方面更为出色,凸显了根据具体临床任务选择适配基础模型的重要性。
English: The study demonstrates that while the general-purpose DINOv2 model excels in detecting ocular diseases like diabetic retinopathy and glaucoma, the domain-specific RETFound performs better in predicting systemic conditions such as heart failure and stroke, emphasizing the need to match foundation models with specific clinical tasks for optimal outcomes.

Authors:Sicen Guo, Tianyou Wen, Chuang-Wei Liu, Qijun Chen, Rui Fan
Title: Fully Exploiting Vision Foundation Model's Profound Prior Knowledge for Generalizable RGB-Depth Driving Scene Parsing
Abstract:
Recent vision foundation models (VFMs), typically based on Vision Transformer (ViT), have significantly advanced numerous computer vision tasks. Despite their success in tasks focused solely on RGB images, the potential of VFMs in RGB-depth driving scene parsing remains largely under-explored. In this article, we take one step toward this emerging research area by investigating a feasible technique to fully exploit VFMs for generalizable RGB-depth driving scene parsing. Specifically, we explore the inherent characteristics of RGB and depth data, thereby presenting a Heterogeneous Feature Integration Transformer (HFIT). This network enables the efficient extraction and integration of comprehensive heterogeneous features without re-training ViTs. Relative depth prediction results from VFMs, used as inputs to the HFIT side adapter, overcome the limitations of the dependence on depth maps. Our proposed HFIT demonstrates superior performance compared to all other traditional single-modal and data-fusion scene parsing networks, pre-trained VFMs, and ViT adapters on the Cityscapes and KITTI Semantics datasets. We believe this novel strategy paves the way for future innovations in VFM-based data-fusion techniques for driving scene parsing. Our source code is publicly available at https://mias.group/HFIT.
中文: 本研究提出了一种异构特征集成变换器(HFIT),通过挖掘RGB和深度数据的特性,在无需重新训练视觉变换器的情况下,显著提升了RGB-深度驾驶场景解析的性能,并在多个基准数据集上表现优异。
English: This study introduces a Heterogeneous Feature Integration Transformer (HFIT) that effectively leverages vision foundation models for RGB-depth driving scene parsing, achieving superior performance on benchmark datasets without retraining ViTs.

Authors:Haibo Zhao, Dian Wang, Yizhe Zhu, Xupeng Zhu, Owen Howell, Linfeng Zhao, Yaoyao Qian, Robin Walters, Robert Platt
Title: Hierarchical Equivariant Policy via Frame Transfer
Abstract:
Recent advances in hierarchical policy learning highlight the advantages of decomposing systems into high-level and low-level agents, enabling efficient long-horizon reasoning and precise fine-grained control. However, the interface between these hierarchy levels remains underexplored, and existing hierarchical methods often ignore domain symmetry, resulting in the need for extensive demonstrations to achieve robust performance. To address these issues, we propose Hierarchical Equivariant Policy (HEP), a novel hierarchical policy framework. We propose a frame transfer interface for hierarchical policy learning, which uses the high-level agent's output as a coordinate frame for the low-level agent, providing a strong inductive bias while retaining flexibility. Additionally, we integrate domain symmetries into both levels and theoretically demonstrate the system's overall equivariance. HEP achieves state-of-the-art performance in complex robotic manipulation tasks, demonstrating significant improvements in both simulation and real-world settings.
中文: 提出的分层等变策略(HEP)通过引入帧传递接口并整合领域对称性,改进了分层策略学习,在机器人操作任务中实现了最先进的性能。
English: The proposed Hierarchical Equivariant Policy (HEP) introduces a frame transfer interface and incorporates domain symmetries to enhance hierarchical policy learning, achieving state-of-the-art performance in robotic manipulation tasks.

Authors:Venkatesh Mishra, Bimsara Pathiraja, Mihir Parmar, Sat Chidananda, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral
Title: Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning
Abstract:
Reasoning abilities of LLMs have been a key focus in recent years. One challenging reasoning domain with interesting nuances is legal reasoning, which requires careful application of rules, and precedents while balancing deductive and analogical reasoning, and conflicts between rules. Although there have been a few works on using LLMs for legal reasoning, their focus has been on overall accuracy. In this paper, we dig deeper to do a step-by-step analysis and figure out where they commit errors. We use the college-level Multiple Choice Question-Answering (MCQA) task from the \textit{Civil Procedure} dataset and propose a new error taxonomy derived from initial manual analysis of reasoning chains with respect to several LLMs, including two objective measures: soundness and correctness scores. We then develop an LLM-based automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs. The computation of soundness and correctness on the dataset using the auto-evaluator framework reveals several interesting insights. Furthermore, we show that incorporating the error taxonomy as feedback in popular prompting techniques marginally increases LLM performance. Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.
中文: 本研究通过新的错误分类法和自动化评估框架对大型语言模型的法律推理错误进行逐步分析,揭示了关键发现,并证明结合错误反馈能略微提升模型性能。
English: This study conducts a step-by-step analysis of legal reasoning errors in LLMs using a new error taxonomy and automated evaluation framework, revealing key insights and demonstrating that incorporating error feedback marginally improves performance.

Authors:Letian Peng, Chenyang An, Shibo Hao, Chengyu Dong, Jingbo Shang
Title: Linear Correlation in LM's Compositional Generalization and Hallucination
Abstract:
The generalization of language models (LMs) is undergoing active debates, contrasting their potential for general intelligence with their struggles with basic knowledge composition (e.g., reverse/transition curse). This paper uncovers the phenomenon of linear correlations in LMs during knowledge composition. For explanation, there exists a linear transformation between certain related knowledge that maps the next token prediction logits from one prompt to another, e.g., "X lives in the city of" $\rightarrow$ "X lives in the country of" for every given X. This mirrors the linearity in human knowledge composition, such as Paris $\rightarrow$ France. Our findings indicate that the linear transformation is resilient to large-scale fine-tuning, generalizing updated knowledge when aligned with real-world relationships, but causing hallucinations when it deviates. Empirical results suggest that linear correlation can serve as a potential identifier of LM's generalization. Finally, we show such linear correlations can be learned with a single feedforward network and pre-trained vocabulary representations, indicating LM generalization heavily relies on the latter.
中文: 本文揭示了语言模型在知识组合中表现出线性相关性,这种相关性在微调后依然稳定,可作为泛化能力的指标,但若与现实不符则可能导致幻觉。
English: This paper reveals that language models exhibit linear correlations in knowledge composition, which are resilient to fine-tuning and can indicate generalization but may lead to hallucinations when misaligned with reality.

Authors:Zongwei Li, Xiaoqi Li, Wenkai Li, Xin Wang
Title: SCALM: Detecting Bad Practices in Smart Contracts Through LLMs
Abstract:
As the Ethereum platform continues to mature and gain widespread usage, it is crucial to maintain high standards of smart contract writing practices. While bad practices in smart contracts may not directly lead to security issues, they do elevate the risk of encountering problems. Therefore, to understand and avoid these bad practices, this paper introduces the first systematic study of bad practices in smart contracts, delving into over 35 specific issues. Specifically, we propose a large language models (LLMs)-based framework, SCALM. It combines Step-Back Prompting and Retrieval-Augmented Generation (RAG) to identify and address various bad practices effectively. Our extensive experiments using multiple LLMs and datasets have shown that SCALM outperforms existing tools in detecting bad practices in smart contracts.
中文摘要:本文提出了SCALM框架,首次系统性地利用大语言模型结合Step-Back Prompting和检索增强生成技术,有效识别并解决智能合约中的35种以上不良实践,实验证明其性能优于现有工具。
English Summary: This paper introduces SCALM, the first systematic framework using large language models with Step-Back Prompting and RAG to effectively identify and address over 35 bad practices in smart contracts, demonstrating superior performance over existing tools.

Authors:Zichang He, Rudy Raymond, Ruslan Shaydulin, Marco Pistoia
Title: Non-Variational Quantum Random Access Optimization with Alternating Operator Ansatz
Abstract:
Solving hard optimization problems is one of the most promising application domains for quantum computers due to the ubiquity of such problems in industry and the availability of broadly applicable quantum speedups. However, the ability of near-term quantum computers to tackle industrial-scale optimization problems is limited by their size and the overheads of quantum error correction. Quantum Random Access Optimization (QRAO) has been proposed to reduce the space requirements of quantum optimization. However, to date QRAO has only been implemented using variational algorithms, which suffer from the need to train instance-specific variational parameters, making them difficult to scale. We propose and benchmark a non-variational approach to QRAO based on the Quantum Alternating Operator Ansatz (QAOA) for the MaxCut problem. We show that instance-independent ``fixed" parameters achieve good performance, removing the need for variational parameter optimization. Additionally, we evaluate different design choices, such as various mixers, initial states, and QRAO-specific implementations of the QAOA cost operator, and identify a strategy that performs well in practice. Our results pave the way for the practical execution of QRAO on early fault-tolerant quantum computers.
Chinese: 本文提出了一种基于量子交替算子Ansatz(QAOA)的非变分量子随机访问优化(QRAO)方法,通过消除参数优化需求并确定有效设计策略,为早期容错量子计算机实现实用量子优化铺平了道路。
English: This paper introduces a non-variational Quantum Random Access Optimization (QRAO) method using QAOA for MaxCut, eliminating parameter training needs and identifying effective design strategies to enable practical quantum optimization on early fault-tolerant devices.

Authors:Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, Yuxuan Wang
Title: DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
Abstract:
Several recent studies have attempted to autoregressively generate continuous speech representations without discrete speech tokens by combining diffusion and autoregressive models, yet they often face challenges with excessive computational loads or suboptimal outcomes. In this work, we propose Diffusion Transformer Autoregressive Modeling (DiTAR), a patch-based autoregressive framework combining a language model with a diffusion transformer. This approach significantly enhances the efficacy of autoregressive models for continuous tokens and reduces computational demands. DiTAR utilizes a divide-and-conquer strategy for patch generation, where the language model processes aggregated patch embeddings and the diffusion transformer subsequently generates the next patch based on the output of the language model. For inference, we propose defining temperature as the time point of introducing noise during the reverse diffusion ODE to balance diversity and determinism. We also show in the extensive scaling analysis that DiTAR has superb scalability. In zero-shot speech generation, DiTAR achieves state-of-the-art performance in robustness, speaker similarity, and naturalness.
中文: 本文提出DiTAR,一种结合语言模型与扩散变换器的基于分块的自回归框架,能高效生成连续语音表征,在零样本语音生成中实现最优性能并降低计算需求。
English: This paper introduces DiTAR, a patch-based autoregressive framework combining a language model with a diffusion transformer to efficiently generate continuous speech representations, achieving state-of-the-art performance in zero-shot speech generation while reducing computational demands.

Authors:Yibo Xu, Dawei Zhou, Decheng Liu, Nannan Wang
Title: Improving Adversarial Robustness via Phase and Amplitude-aware Prompting
Abstract:
Deep neural networks are found to be vulnerable to adversarial perturbations. The prompt-based defense has been increasingly studied due to its high efficiency. However, existing prompt-based defenses mainly exploited mixed prompt patterns, where critical patterns closely related to object semantics lack sufficient focus. The phase and amplitude spectra have been proven to be highly related to specific semantic patterns and crucial for robustness. To this end, in this paper, we propose a Phase and Amplitude-aware Prompting (PAP) defense. Specifically, we construct phase-level and amplitude-level prompts for each class, and adjust weights for prompting according to the model's robust performance under these prompts during training. During testing, we select prompts for each image using its predicted label to obtain the prompted image, which is inputted to the model to get the final prediction. Experimental results demonstrate the effectiveness of our method.
中文摘要:本文提出的相位与幅度感知提示(PAP)防御方法通过构建相位和幅度谱提示来增强深度神经网络对抗扰动的鲁棒性,实验验证了该方法的有效性。
English Summary: The proposed Phase and Amplitude-aware Prompting (PAP) defense method constructs specialized prompts using phase and amplitude spectra to enhance neural network robustness against adversarial attacks, demonstrating superior performance in experiments.

Authors:Jacob de Nobel, Diederick Vermetten, Hao Wang, Anna V. Kononova, Günter Rudolph, Thomas Bäck
Title: Abnormal Mutations: Evolution Strategies Don't Require Gaussianity
Abstract:
The mutation process in evolution strategies has been interlinked with the normal distribution since its inception. Many lines of reasoning have been given for this strong dependency, ranging from maximum entropy arguments to the need for isotropy. However, some theoretical results suggest that other distributions might lead to similar local convergence properties. This paper empirically shows that a wide range of evolutionary strategies, from the (1+1)-ES to CMA-ES, show comparable optimization performance when using a mutation distribution other than the standard Gaussian. Replacing it with, e.g., uniformly distributed mutations, does not deteriorate the performance of ES, when using the default adaptation mechanism for the strategy parameters. We observe that these results hold not only for the sphere model but also for a wider range of benchmark problems.
中文: 研究表明,多种进化策略在使用非高斯分布(如均匀分布)进行突变时,在多种基准问题上仍能保持相当的优化性能。
English: This study demonstrates that various evolutionary strategies maintain comparable optimization performance when using non-Gaussian mutation distributions, such as uniform distributions, across multiple benchmark problems.

Authors:Pat Pataranutaporn, Alexander Doudkin, Pattie Maes
Title: OceanChat: The Effect of Virtual Conversational AI Agents on Sustainable Attitude and Behavior Change
Abstract:
Marine ecosystems face unprecedented threats from climate change and plastic pollution, yet traditional environmental education often struggles to translate awareness into sustained behavioral change. This paper presents OceanChat, an interactive system leveraging large language models to create conversational AI agents represented as animated marine creatures -- specifically a beluga whale, a jellyfish, and a seahorse -- designed to promote environmental behavior (PEB) and foster awareness through personalized dialogue. Through a between-subjects experiment (N=900), we compared three conditions: (1) Static Scientific Information, providing conventional environmental education through text and images; (2) Static Character Narrative, featuring first-person storytelling from 3D-rendered marine creatures; and (3) Conversational Character Narrative, enabling real-time dialogue with AI-powered marine characters. Our analysis revealed that the Conversational Character Narrative condition significantly increased behavioral intentions and sustainable choice preferences compared to static approaches. The beluga whale character demonstrated consistently stronger emotional engagement across multiple measures, including perceived anthropomorphism and empathy. However, impacts on deeper measures like climate policy support and psychological distance were limited, highlighting the complexity of shifting entrenched beliefs. Our work extends research on sustainability interfaces facilitating PEB and offers design principles for creating emotionally resonant, context-aware AI characters. By balancing anthropomorphism with species authenticity, OceanChat demonstrates how interactive narratives can bridge the gap between environmental knowledge and real-world behavior change.
中文摘要:OceanChat是一款利用海洋生物角色进行对话的交互式AI系统,通过个性化交流显著提升了环保行为意愿,但对深层信念体系的影响仍显不足。
English Summary: OceanChat is an interactive AI system using conversational marine creature characters that significantly boosts environmental behavioral intentions through personalized dialogue, though its impact on deeper belief systems remains limited.

Authors:Zag ElSayed, Nelly Elsayed, Ahmed Abdelgawad
Title: Carbon Per Transistor (CPT): The Golden Formula for Green Computing Metrics
Abstract:
As computing power advances, the environmental cost of semiconductor manufacturing and operation has become a critical concern. However, current sustainability metrics fail to quantify carbon emissions at the transistor level, the fundamental building block of modern processors. This paper introduces a Carbon Per Transistor (CPT) formula -- a novel approach and green implementation metric to measuring the CO$_2$ footprint of semiconductor chips from fabrication to end-of-life. By integrating emissions from silicon crystal growth, wafer production, chip manufacturing, and operational power dissipation, the CPT formula provides a scientifically rigorous benchmark for evaluating the sustainability of computing hardware. Using real-world data from Intel Core i9-13900K, AMD Ryzen 9 7950X, and Apple M1/M2/M3 processors, we reveal a startling insight-manufacturing emissions dominate, contributing 60-125 kg CO$_2$ per CPU, far exceeding operational emissions over a typical device lifespan. Notably, Apple's high-transistor-count M-series chips, despite their energy efficiency, exhibit a significantly larger carbon footprint than traditional processors due to extensive fabrication impact. This research establishes a critical reference point for green computing initiatives, enabling industry leaders and researchers to make data-driven decisions in reducing semiconductor-related emissions and get correct estimates for the green factor of the information technology process. The proposed formula paves the way for carbon-aware chip design, regulatory standards, and future innovations in sustainable computing.
中文: 本文提出晶体管碳排放(CPT)公式,用于量化半导体从制造到运行的碳排放,揭示制造过程是碳足迹的主要来源,苹果M系列芯片虽能效高但因制造影响碳排放更大。
English: This paper introduces a Carbon Per Transistor (CPT) formula to quantify semiconductor carbon emissions from manufacturing to operation, revealing that manufacturing contributes most to the carbon footprint, with Apple's M-series chips showing higher emissions despite their energy efficiency.

Authors:Xupeng Zhu, David Klee, Dian Wang, Boce Hu, Haojie Huang, Arsh Tangri, Robin Walters, Robert Platt
Title: Coarse-to-Fine 3D Keyframe Transporter
Abstract:
Recent advances in Keyframe Imitation Learning (IL) have enabled learning-based agents to solve a diverse range of manipulation tasks. However, most approaches ignore the rich symmetries in the problem setting and, as a consequence, are sample-inefficient. This work identifies and utilizes the bi-equivariant symmetry within Keyframe IL to design a policy that generalizes to transformations of both the workspace and the objects grasped by the gripper. We make two main contributions: First, we analyze the bi-equivariance properties of the keyframe action scheme and propose a Keyframe Transporter derived from the Transporter Networks, which evaluates actions using cross-correlation between the features of the grasped object and the features of the scene. Second, we propose a computationally efficient coarse-to-fine SE(3) action evaluation scheme for reasoning the intertwined translation and rotation action. The resulting method outperforms strong Keyframe IL baselines by an average of >10% on a wide range of simulation tasks, and by an average of 55% in 4 physical experiments.
Chinese: 近期关键帧模仿学习方法通过利用双等变对称性,开发了关键帧传输器和粗到精的SE(3)动作评估方案,在仿真和物理实验中显著提升了任务性能。
English: Recent Keyframe Imitation Learning methods have been enhanced by incorporating bi-equivariant symmetry, leading to the development of a Keyframe Transporter and a coarse-to-fine SE(3) action evaluation scheme that significantly improves performance in both simulated and physical tasks.

Authors:Jinwei Hu, Yi Dong, Shuang Ao, Zhuoyun Li, Boxuan Wang, Lokesh Singh, Guangliang Cheng, Sarvapali D. Ramchurn, Xiaowei Huang
Title: Position: Towards a Responsible LLM-empowered Multi-Agent Systems
Abstract:
The rise of Agent AI and Large Language Model-powered Multi-Agent Systems (LLM-MAS) has underscored the need for responsible and dependable system operation. Tools like LangChain and Retrieval-Augmented Generation have expanded LLM capabilities, enabling deeper integration into MAS through enhanced knowledge retrieval and reasoning. However, these advancements introduce critical challenges: LLM agents exhibit inherent unpredictability, and uncertainties in their outputs can compound across interactions, threatening system stability. To address these risks, a human-centered design approach with active dynamic moderation is essential. Such an approach enhances traditional passive oversight by facilitating coherent inter-agent communication and effective system governance, allowing MAS to achieve desired outcomes more efficiently.
中文: LLM驱动的多智能体系统发展带来了不可预测性和叠加不确定性,需采用以人为中心的设计和主动动态调节,以确保通信连贯和系统稳定治理。
English: The advancement of LLM-powered Multi-Agent Systems brings unpredictability and compounded uncertainties, necessitating human-centered design with active dynamic moderation to ensure coherent communication and stable governance.

Authors:Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, Nick Kolkin
Title: SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
Abstract:
We present SliderSpace, a framework for automatically decomposing the visual capabilities of diffusion models into controllable and human-understandable directions. Unlike existing control methods that require a user to specify attributes for each edit direction individually, SliderSpace discovers multiple interpretable and diverse directions simultaneously from a single text prompt. Each direction is trained as a low-rank adaptor, enabling compositional control and the discovery of surprising possibilities in the model's latent space. Through extensive experiments on state-of-the-art diffusion models, we demonstrate SliderSpace's effectiveness across three applications: concept decomposition, artistic style exploration, and diversity enhancement. Our quantitative evaluation shows that SliderSpace-discovered directions decompose the visual structure of model's knowledge effectively, offering insights into the latent capabilities encoded within diffusion models. User studies further validate that our method produces more diverse and useful variations compared to baselines. Our code, data and trained weights are available at https://sliderspace.baulab.info
Chinese: SliderSpace是一个框架,能够从单一文本提示中自动将扩散模型的视觉能力分解为可控且人类可理解的方向,支持概念分解和艺术风格探索等应用。
English: SliderSpace is a framework that automatically breaks down the visual capabilities of diffusion models into controllable, human-understandable directions from a single text prompt, enabling applications like concept decomposition and artistic style exploration.

Authors:Jinwei Hu, Zhenglin Huang, Xiangyu Yin, Wenjie Ruan, Guangliang Cheng, Yi Dong, Xiaowei Huang
Title: FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model
Abstract:
Large language models have been widely applied, but can inadvertently encode sensitive or harmful information, raising significant safety concerns. Machine unlearning has emerged to alleviate this concern; however, existing training-time unlearning approaches, relying on coarse-grained loss combinations, have limitations in precisely separating knowledge and balancing removal effectiveness with model utility. In contrast, we propose Fine-grained Activation manipuLation by Contrastive Orthogonal uNalignment (FALCON), a novel representation-guided unlearning approach that leverages information-theoretic guidance for efficient parameter selection, employs contrastive mechanisms to enhance representation separation, and projects conflict gradients onto orthogonal subspaces to resolve conflicts between forgetting and retention objectives. Extensive experiments demonstrate that FALCON achieves superior unlearning effectiveness while maintaining model utility, exhibiting robust resistance against knowledge recovery attempts.
中文: 大语言模型可能无意中编码有害信息,现有遗忘方法难以兼顾清除效果与模型效用,因此提出FALCON方法,通过对比机制和正交投影实现精准知识分离,在有效消除敏感信息的同时保持模型性能。
English: Large language models risk encoding harmful data, but current unlearning methods struggle to balance removal and utility, prompting the development of FALCON, which uses contrastive mechanisms and orthogonal projections to effectively erase sensitive information while preserving model performance.

Authors:Ludwig Bothmann, Philip A. Boustani, Jose M. Alvarez, Giuseppe Casalicchio, Bernd Bischl, Susanne Dandl
Title: Privilege Scores
Abstract:
Bias-transforming methods of fairness-aware machine learning aim to correct a non-neutral status quo with respect to a protected attribute (PA). Current methods, however, lack an explicit formulation of what drives non-neutrality. We introduce privilege scores (PS) to measure PA-related privilege by comparing the model predictions in the real world with those in a fair world in which the influence of the PA is removed. At the individual level, PS can identify individuals who qualify for affirmative action; at the global level, PS can inform bias-transforming policies. After presenting estimation methods for PS, we propose privilege score contributions (PSCs), an interpretation method that attributes the origin of privilege to mediating features and direct effects. We provide confidence intervals for both PS and PSCs. Experiments on simulated and real-world data demonstrate the broad applicability of our methods and provide novel insights into gender and racial privilege in mortgage and college admissions applications.
This paper introduces privilege scores to quantify bias related to protected attributes in machine learning, offering both individual-level identification for affirmative action and global insights for policy-making, along with interpretation methods to trace privilege origins and validate their application in real-world scenarios.
English Summary:

Authors:Linhao Luo, Zicheng Zhao, Gholamreza Haffari, Dinh Phung, Chen Gong, Shirui Pan
Title: GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) has proven effective in integrating knowledge into large language models (LLMs). However, conventional RAGs struggle to capture complex relationships between pieces of knowledge, limiting their performance in intricate reasoning that requires integrating knowledge from multiple sources. Recently, graph-enhanced retrieval augmented generation (GraphRAG) builds graph structure to explicitly model these relationships, enabling more effective and efficient retrievers. Nevertheless, its performance is still hindered by the noise and incompleteness within the graph structure. To address this, we introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation. GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships. The GFM with 8M parameters undergoes a two-stage training process on large-scale datasets, comprising 60 knowledge graphs with over 14M triples and 700k documents. This results in impressive performance and generalizability for GFM-RAG, making it the first graph foundation model applicable to unseen datasets for retrieval without any fine-tuning required. Extensive experiments on three multi-hop QA datasets and seven domain-specific RAG datasets demonstrate that GFM-RAG achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws, highlighting its potential for further improvement.
中文: GFM-RAG提出了一种新颖的图基础模型,通过图推理有效捕捉复杂查询与知识间的关系,无需微调即可在多个数据集上实现最优性能。
English: GFM-RAG introduces a novel graph foundation model that effectively captures complex query-knowledge relationships through graph reasoning, achieving state-of-the-art performance across multiple datasets without requiring fine-tuning.

Authors:Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan
Title: PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
Abstract:
The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead. We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge. We observe that outliers typically appear in only one of two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-structured patterns, with radii and angles smoothly distributed in polar coordinates. This alleviates the challenge of outliers on per-channel quantization, making them well-suited for quantization. Thus, PolarQuant divides key vectors into groups of two-dimensional sub-vectors, encoding them as the corresponding quantized radius and the polar angle, rather than quantizing original key vectors directly. PolarQuant achieves the superior efficiency in KV cache quantization and accelerates the decoding process by turning the query-key inner product into a table lookup, all while maintaining the downstream performance of full-precision models.
中文: PolarQuant提出了一种新颖的量化方法,通过将关键向量分组为二维子向量并在极坐标中编码,有效解决了KV缓存中的异常值问题,在保持模型性能的同时实现了更高的效率和更快的解码速度。
English: PolarQuant introduces a novel quantization method that addresses outlier challenges in KV cache by grouping key vectors into two-dimensional sub-vectors and encoding them in polar coordinates, achieving superior efficiency and faster decoding while maintaining model performance.

Authors:Maximilian Egger, Rawad Bitar, Antonia Wachter-Zeh, Nir Weinberger, Deniz Gündüz
Title: BICompFL: Stochastic Federated Learning with Bi-Directional Compression
Abstract:
We address the prominent communication bottleneck in federated learning (FL). We specifically consider stochastic FL, in which models or compressed model updates are specified by distributions rather than deterministic parameters. Stochastic FL offers a principled approach to compression, and has been shown to reduce the communication load under perfect downlink transmission from the federator to the clients. However, in practice, both the uplink and downlink communications are constrained. We show that bi-directional compression for stochastic FL has inherent challenges, which we address by introducing BICompFL. Our BICompFL is experimentally shown to reduce the communication cost by an order of magnitude compared to multiple benchmarks, while maintaining state-of-the-art accuracies. Theoretically, we study the communication cost of BICompFL through a new analysis of an importance-sampling based technique, which exposes the interplay between uplink and downlink communication costs.
Chinese Summary: 本研究提出了BICompFL方法,通过实现双向压缩有效解决了随机联邦学习中的通信瓶颈问题,在保持高精度的同时显著降低了通信成本。
English Summary: The study introduces BICompFL, a method that effectively tackles the communication bottleneck in stochastic federated learning by enabling bi-directional compression, significantly reducing communication costs while preserving high accuracy.

Authors:Maximilian Egger, Mayank Bakshi, Rawad Bitar
Title: Byzantine-Resilient Zero-Order Optimization for Communication-Efficient Heterogeneous Federated Learning
Abstract:
We introduce CyBeR-0, a Byzantine-resilient federated zero-order optimization method that is robust under Byzantine attacks and provides significant savings in uplink and downlink communication costs. We introduce transformed robust aggregation to give convergence guarantees for general non-convex objectives under client data heterogeneity. Empirical evaluations for standard learning tasks and fine-tuning large language models show that CyBeR-0 exhibits stable performance with only a few scalars per-round communication cost and reduced memory requirements.
Chinese: CyBeR-0是一种拜占庭鲁棒的联邦零阶优化方法,在攻击下保持稳定性能,同时大幅降低通信成本和内存需求,适用于非凸目标优化。
English: CyBeR-0 is a Byzantine-resilient federated zero-order optimization method that ensures robust performance under attacks while significantly reducing communication costs and memory requirements for non-convex objectives.

Authors:Pat Pataranutaporn, Nattavudh Powdthavee, Chayapatr Achiwaranguprok, Pattie Maes
Title: Can AI Solve the Peer Review Crisis? A Large Scale Cross Model Experiment of LLMs' Performance and Biases in Evaluating over 1000 Economics Papers
Abstract:
This study examines the potential of large language models (LLMs) to augment the academic peer review process by reliably evaluating the quality of economics research without introducing systematic bias. We conduct one of the first large-scale experimental assessments of four LLMs (GPT-4o, Claude 3.5, Gemma 3, and LLaMA 3.3) across two complementary experiments. In the first, we use nonparametric binscatter and linear regression techniques to analyze over 29,000 evaluations of 1,220 anonymized papers drawn from 110 economics journals excluded from the training data of current LLMs, along with a set of AI-generated submissions. The results show that LLMs consistently distinguish between higher- and lower-quality research based solely on textual content, producing quality gradients that closely align with established journal prestige measures. Claude and Gemma perform exceptionally well in capturing these gradients, while GPT excels in detecting AI-generated content. The second experiment comprises 8,910 evaluations designed to assess whether LLMs replicate human like biases in single blind reviews. By systematically varying author gender, institutional affiliation, and academic prominence across 330 papers, we find that GPT, Gemma, and LLaMA assign significantly higher ratings to submissions from top male authors and elite institutions relative to the same papers presented anonymously. These results emphasize the importance of excluding author-identifying information when deploying LLMs in editorial screening. Overall, our findings provide compelling evidence and practical guidance for integrating LLMs into peer review to enhance efficiency, improve accuracy, and promote equity in the publication process of economics research.
本研究显示,大型语言模型能有效评估经济学研究质量并识别AI生成内容,但若未隐去作者信息则可能复制人类评审的偏见。
This study demonstrates that large language models can effectively assess the quality of economics research and detect AI-generated content, but they may replicate human biases if author information is not anonymized.

Authors:Zhuorui Zhao, Ruidi Qiu, Ing-Chao Lin, Grace Li Zhang, Bing Li, Ulf Schlichtmann
Title: VRank: Enhancing Verilog Code Generation from Large Language Models via Self-Consistency
Abstract:
Large Language Models (LLMs) have demonstrated promising capabilities in generating Verilog code from module specifications. To improve the quality of such generated Verilog codes, previous methods require either time-consuming manual inspection or generation of multiple Verilog codes, from which the one with the highest quality is selected with manually designed testbenches. To enhance the generation efficiency while maintaining the quality of the generated codes, we propose VRank, an automatic framework that generates Verilog codes with LLMs. In our framework, multiple code candidates are generated with LLMs by leveraging their probabilistic nature. Afterwards, we group Verilog code candidates into clusters based on identical outputs when tested against the same testbench, which is also generated by LLMs. Clusters are ranked based on the consistency they show on testbench. To determine the best candidate, Chain-of-Thought is further applied to select the best candidate from the top-ranked clusters. By systematically analyzing diverse outputs of generated codes, VRank reduces errors and enhances the overall quality of the generated Verilog code. Experimental results on the VerilogEval-Human benchmark demonstrate a significant 10.5% average increase in functional correctness (passl1) across multiple LLMs, demonstrating VRank's effectiveness in improving the accuracy of automated hardware description language generation for complex design tasks.
Chinese: VRank是一种自动化框架,通过利用大语言模型生成多个Verilog代码候选,根据测试平台输出一致性进行聚类,并采用思维链从最优集群中选取最佳代码,从而将功能正确性平均提升10.5%。
English: VRank is an automated framework that enhances Verilog code generation by using LLMs to produce multiple code candidates, clustering them based on testbench output consistency, and selecting the best candidate via Chain-of-Thought, resulting in a 10.5% average improvement in functional correctness.

Authors:Ziyi Zhang, Zhen Sun, Zongmin Zhang, Jihui Guo, Xinlei He
Title: FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts
Abstract:
Multimodal Large Language Models (MLLMs) have become powerful and widely adopted in some practical applications. However, recent research has revealed their vulnerability to multimodal jailbreak attacks, whereby the model can be induced to generate harmful content, leading to safety risks. Although most MLLMs have undergone safety alignment, recent research shows that the visual modality is still vulnerable to jailbreak attacks. In our work, we discover that by using flowcharts with partially harmful information, MLLMs can be induced to provide additional harmful details. Based on this, we propose a jailbreak attack method based on auto-generated flowcharts, FC-Attack. Specifically, FC-Attack first fine-tunes a pre-trained LLM to create a step-description generator based on benign datasets. The generator is then used to produce step descriptions corresponding to a harmful query, which are transformed into flowcharts in 3 different shapes (vertical, horizontal, and S-shaped) as visual prompts. These flowcharts are then combined with a benign textual prompt to execute the jailbreak attack on MLLMs. Our evaluations on Advbench show that FC-Attack attains an attack success rate of up to 96% via images and up to 78% via videos across multiple MLLMs. Additionally, we investigate factors affecting the attack performance, including the number of steps and the font styles in the flowcharts. We also find that FC-Attack can improve the jailbreak performance from 4% to 28% in Claude-3.5 by changing the font style. To mitigate the attack, we explore several defenses and find that AdaShield can largely reduce the jailbreak performance but with the cost of utility drop.
中文:多模态大语言模型(MLLMs)易受流程图越狱攻击,FC-Attack方法通过自动生成含部分有害信息的流程图实现高攻击成功率,需采用如AdaShield等防御措施,但可能牺牲部分实用性。
English: Multimodal Large Language Models (MLLMs) remain vulnerable to jailbreak attacks, as demonstrated by FC-Attack, which uses auto-generated flowcharts with partially harmful information to achieve high success rates, prompting the need for effective defenses like AdaShield despite potential utility trade-offs.

Authors:Maximilian Rokuss, Yannick Kirchhoff, Seval Akbal, Balint Kovacs, Saikat Roy, Constantin Ulrich, Tassilo Wald, Lukas T. Rotkopf, Heinz-Peter Schlemmer, Klaus Maier-Hein
Title: LesionLocator: Zero-Shot Universal Tumor Segmentation and Tracking in 3D Whole-Body Imaging
Abstract:
In this work, we present LesionLocator, a framework for zero-shot longitudinal lesion tracking and segmentation in 3D medical imaging, establishing the first end-to-end model capable of 4D tracking with dense spatial prompts. Our model leverages an extensive dataset of 23,262 annotated medical scans, as well as synthesized longitudinal data across diverse lesion types. The diversity and scale of our dataset significantly enhances model generalizability to real-world medical imaging challenges and addresses key limitations in longitudinal data availability. LesionLocator outperforms all existing promptable models in lesion segmentation by nearly 10 dice points, reaching human-level performance, and achieves state-of-the-art results in lesion tracking, with superior lesion retrieval and segmentation accuracy. LesionLocator not only sets a new benchmark in universal promptable lesion segmentation and automated longitudinal lesion tracking but also provides the first open-access solution of its kind, releasing our synthetic 4D dataset and model to the community, empowering future advancements in medical imaging. Code is available at: www.github.com/MIC-DKFZ/LesionLocator
中文: LesionLocator 提出了首个端到端的零样本三维医学影像病灶纵向追踪与分割框架,通过开源模型和合成数据集实现了人类水平的性能,并树立了新的行业基准。
English: LesionLocator introduces the first end-to-end framework for zero-shot longitudinal lesion tracking and segmentation in 3D medical imaging, achieving human-level performance and setting new benchmarks with its open-access model and synthetic dataset.

Authors:Yichi Zhang, Bohao Lv, Le Xue, Wenbo Zhang, Yuchen Liu, Yu Fu, Yuan Cheng, Yuan Qi
Title: SemiSAM+: Rethinking Semi-Supervised Medical Image Segmentation in the Era of Foundation Models
Abstract:
Deep learning-based medical image segmentation typically requires large amount of labeled data for training, making it less applicable in clinical settings due to high annotation cost. Semi-supervised learning (SSL) has emerged as an appealing strategy due to its less dependence on acquiring abundant annotations from experts compared to fully supervised methods. Beyond existing model-centric advancements of SSL by designing novel regularization strategies, we anticipate a paradigmatic shift due to the emergence of promptable segmentation foundation models with universal segmentation capabilities using positional prompts represented by Segment Anything Model (SAM). In this paper, we present SemiSAM+, a foundation model-driven SSL framework to efficiently learn from limited labeled data for medical image segmentation. SemiSAM+ consists of one or multiple promptable foundation models as generalist models, and a trainable task-specific segmentation model as specialist model. For a given new segmentation task, the training is based on the specialist-generalist collaborative learning procedure, where the trainable specialist model delivers positional prompts to interact with the frozen generalist models to acquire pseudo-labels, and then the generalist model output provides the specialist model with informative and efficient supervision which benefits the automatic segmentation and prompt generation in turn. Extensive experiments on two public datasets and one in-house clinical dataset demonstrate that SemiSAM+ achieves significant performance improvement, especially under extremely limited annotation scenarios, and shows strong efficiency as a plug-and-play strategy that can be easily adapted to different specialist and generalist models.
中文摘要:SemiSAM+是一种半监督学习框架,通过冻结通用基础模型与可训练专用模型的协作,利用位置提示生成伪标签,在标注数据极少的医疗图像分割任务中实现了显著性能提升。
English Summary: SemiSAM+ is a semi-supervised learning framework that combines frozen foundation models with trainable specialist models to achieve efficient medical image segmentation using minimal labeled data through collaborative prompt generation and pseudo-labeling.

Authors:Heejin Do, Sangwon Ryu, Gary Geunbae Lee
Title: Teach-to-Reason with Scoring: Self-Explainable Rationale-Driven Multi-Trait Essay Scoring
Abstract:
Multi-trait automated essay scoring (AES) systems provide a fine-grained evaluation of an essay's diverse aspects. While they excel in scoring, prior systems fail to explain why specific trait scores are assigned. This lack of transparency leaves instructors and learners unconvinced of the AES outputs, hindering their practical use. To address this, we propose a self-explainable Rationale-Driven Multi-trait automated Essay scoring (RaDME) framework. RaDME leverages the reasoning capabilities of large language models (LLMs) by distilling them into a smaller yet effective scorer. This more manageable student model is optimized to sequentially generate a trait score followed by the corresponding rationale, thereby inherently learning to select a more justifiable score by considering the subsequent rationale during training. Our findings indicate that while LLMs underperform in direct AES tasks, they excel in rationale generation when provided with precise numerical scores. Thus, RaDME integrates the superior reasoning capacities of LLMs into the robust scoring accuracy of an optimized smaller model. Extensive experiments demonstrate that RaDME achieves both accurate and adequate reasoning while supporting high-quality multi-trait scoring, significantly enhancing the transparency of AES.
中文:RaDME框架通过将大型语言模型的推理能力融入小型模型,使其依次生成分数和理由,从而提升了多维度自动作文评分的准确性和透明度。
English: The RaDME framework enhances multi-trait automated essay scoring by integrating large language models' reasoning capabilities into a smaller model that sequentially generates scores and rationales, improving both accuracy and transparency.

Authors:Dexter Ong, Yuezhan Tao, Varun Murali, Igor Spasojevic, Vijay Kumar, Pratik Chaudhari
Title: ATLAS Navigator: Active Task-driven LAnguage-embedded Gaussian Splatting
Abstract:
We address the challenge of task-oriented navigation in unstructured and unknown environments, where robots must incrementally build and reason on rich, metric-semantic maps in real time. Since tasks may require clarification or re-specification, it is necessary for the information in the map to be rich enough to enable generalization across a wide range of tasks. To effectively execute tasks specified in natural language, we propose a hierarchical representation built on language-embedded Gaussian splatting that enables both sparse semantic planning that lends itself to online operation and dense geometric representation for collision-free navigation. We validate the effectiveness of our method through real-world robot experiments conducted in both cluttered indoor and kilometer-scale outdoor environments, with a competitive ratio of about 60% against privileged baselines. Experiment videos and more details can be found on our project page: https://atlasnav.github.io
中文: 本研究提出一种基于语言嵌入高斯溅射的分层表示方法,用于未知环境中面向任务的机器人导航,实现实时语义规划与无碰撞运动,实验验证显示其相对基准方法达到约60%的竞争性效能。
English: This study presents a hierarchical representation using language-embedded Gaussian splatting for task-oriented robot navigation, enabling real-time semantic planning and collision-free movement in unknown environments, with experimental validation showing 60% effectiveness against benchmarks.

Authors:Jana Vatter, Mykhaylo Zayats, Marcos Martínez Galindo, Vanessa López, Ruben Mayer, Hans-Arno Jacobsen, Hoang Thanh Lam
Title: WaveGAS: Waveform Relaxation for Scaling Graph Neural Networks
Abstract:
With the ever-growing size of real-world graphs, numerous techniques to overcome resource limitations when training Graph Neural Networks (GNNs) have been developed. One such approach, GNNAutoScale (GAS), uses graph partitioning to enable training under constrained GPU memory. GAS also stores historical embedding vectors, which are retrieved from one-hop neighbors in other partitions, ensuring critical information is captured across partition boundaries. The historical embeddings which come from the previous training iteration are stale compared to the GAS estimated embeddings, resulting in approximation errors of the training algorithm. Furthermore, these errors accumulate over multiple layers, leading to suboptimal node embeddings. To address this shortcoming, we propose two enhancements: first, WaveGAS, inspired by waveform relaxation, performs multiple forward passes within GAS before the backward pass, refining the approximation of historical embeddings and gradients to improve accuracy; second, a gradient-tracking method that stores and utilizes more accurate historical gradients during training. Empirical results show that WaveGAS enhances GAS and achieves better accuracy, even outperforming methods that train on full graphs, thanks to its robust estimation of node embeddings.
中文摘要:WaveGAS通过引入多轮前向传播和梯度追踪机制改进GNNAutoScale,有效减少因历史嵌入陈旧产生的近似误差,其增强的节点嵌入估计能力甚至超越了全图训练方法的精度表现。
English Summary: WaveGAS enhances GNNAutoScale by introducing multiple forward passes and gradient tracking to reduce approximation errors from stale embeddings, achieving superior accuracy even over full-graph training methods.

Authors:Taiqiang Wu, Chenchen Ding, Wenyong Zhou, Yuxin Cheng, Xincheng Feng, Shuqi Wang, Chufan Shi, Zhengwu Liu, Ngai Wong
Title: HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture
Abstract:
Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.
中文: 本文提出了HaLoRA方法,这是一种硬件感知的低秩自适应技术,通过在混合内存计算架构上部署LoRA微调的大语言模型,有效提升了模型在噪声环境下的鲁棒性和准确性,并在多项推理任务中实现了显著性能提升。
English: This paper introduces HaLoRA, a hardware-aware low-rank adaptation method that enhances the robustness and accuracy of LoRA-finetuned large language models deployed on hybrid compute-in-memory architectures, achieving significant performance improvements across reasoning tasks.

Authors:Shaola Ren, Li Ke, Longtao Huang, Dehong Gao, Hui Xue
Title: QExplorer: Large Language Model Based Query Extraction for Toxic Content Exploration
Abstract:
Automatically extracting effective queries is challenging in information retrieval, especially in toxic content exploration, as such content is likely to be disguised. With the recent achievements in generative Large Language Model (LLM), we are able to leverage the capabilities of LLMs to extract effective queries for similar content exploration directly. This study proposes QExplorer, an approach of large language model based Query Extraction for toxic content Exploration. The QExplorer approach involves a 2-stage training process: instruction Supervised FineTuning (SFT) and preference alignment using Direct Preference Optimization (DPO), as well as the datasets construction with feedback of search system. To verify the effectiveness of QExplorer, a series of offline and online experiments are conducted on our real-world system. The offline empirical results demonstrate that the performance of our automatic query extraction outperforms that of several LLMs and humans. The online deployment shows a significant increase in the detection of toxic items.
中文: 本研究提出QExplorer方法,利用大语言模型通过两阶段训练自动生成有效查询来探测隐蔽的有害内容,离线与在线实验均表明该方法在检测效果上优于其他模型及人工操作。
English: This study introduces QExplorer, a method leveraging large language models to automatically generate effective queries for detecting disguised toxic content through a two-stage training process, which has proven superior in performance to other models and human efforts in both offline and online tests.

Authors:Qian Wang, Zhenheng Tang, Bingsheng He
Title: From ChatGPT to DeepSeek: Can LLMs Simulate Humanity?
Abstract:
Simulation powered by Large Language Models (LLMs) has become a promising method for exploring complex human social behaviors. However, the application of LLMs in simulations presents significant challenges, particularly regarding their capacity to accurately replicate the complexities of human behaviors and societal dynamics, as evidenced by recent studies highlighting discrepancies between simulated and real-world interactions. We rethink LLM-based simulations by emphasizing both their limitations and the necessities for advancing LLM simulations. By critically examining these challenges, we aim to offer actionable insights and strategies for enhancing the applicability of LLM simulations in human society in the future.
中文: 基于大语言模型的模拟是研究人类社会行为的有前景方法,但在准确复制现实世界复杂性方面面临挑战,需要通过批判性审视和策略改进来提升其未来适用性。
English: LLM-based simulation is a promising method for studying human social behaviors, yet it faces challenges in accurately replicating real-world complexities, requiring critical examination and strategic improvements for future applicability.

Authors:Yihang Yao, Zhepeng Cen, Miao Li, William Han, Yuyou Zhang, Emerson Liu, Zuxin Liu, Chuang Gan, Ding Zhao
Title: Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training
Abstract:
Large Language Models (LLMs) have demonstrated strong reasoning capabilities across various tasks. However, even minor variations in query phrasing, despite preserving the underlying semantic meaning, can significantly affect their performance. To address this, we focus on enhancing LLMs' awareness of symmetry in query variations and propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model's ability to extract useful information from context. Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage through query augmentations, enabling more data-efficient training and stronger generalization to Out-of-Distribution (OOD) settings. Extensive experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations, providing new insight into improving LLM robustness through structured dataset curation.
中文: 提出的MEND数据增强方法通过提升大语言模型对查询变体的对称性认知,以结构化数据集优化实现了更强的推理性能和泛化能力。
English: The proposed MEND data augmentation method enhances LLM robustness by improving symmetry awareness in query variations, enabling superior reasoning performance and generalization through structured dataset refinement.

Authors:Namkyeong Lee, Edward De Brouwer, Ehsan Hajiramezanali, Tommaso Biancalani, Chanyoung Park, Gabriele Scalia
Title: RAG-Enhanced Collaborative LLM Agents for Drug Discovery
Abstract:
Recent advances in large language models (LLMs) have shown great potential to accelerate drug discovery. However, the specialized nature of biochemical data often necessitates costly domain-specific fine-tuning, posing critical challenges. First, it hinders the application of more flexible general-purpose LLMs in cutting-edge drug discovery tasks. More importantly, it impedes the rapid integration of the vast amounts of scientific data continuously generated through experiments and research. To investigate these challenges, we propose CLADD, a retrieval-augmented generation (RAG)-empowered agentic system tailored to drug discovery tasks. Through the collaboration of multiple LLM agents, CLADD dynamically retrieves information from biomedical knowledge bases, contextualizes query molecules, and integrates relevant evidence to generate responses -- all without the need for domain-specific fine-tuning. Crucially, we tackle key obstacles in applying RAG workflows to biochemical data, including data heterogeneity, ambiguity, and multi-source integration. We demonstrate the flexibility and effectiveness of this framework across a variety of drug discovery tasks, showing that it outperforms general-purpose and domain-specific LLMs as well as traditional deep learning approaches.
中文:提出的CLADD系统采用多智能体检索增强生成框架,无需领域微调即可应对生化数据挑战,在药物发现任务中相比现有方法展现出更优性能。
English: The proposed CLADD system employs a multi-agent retrieval-augmented generation framework to overcome biochemical data challenges without domain-specific fine-tuning, demonstrating superior performance across drug discovery tasks compared to existing methods.

Authors:Kianoosh Kazemi, Iman Azimi, Michelle Khine, Rami N. Khayat, Amir M. Rahmani, Pasi Liljeberg
Title: Multimodal Sleep Stage and Sleep Apnea Classification Using Vision Transformer: A Multitask Explainable Learning Approach
Abstract:
Sleep is an essential component of human physiology, contributing significantly to overall health and quality of life. Accurate sleep staging and disorder detection are crucial for assessing sleep quality. Studies in the literature have proposed PSG-based approaches and machine-learning methods utilizing single-modality signals. However, existing methods often lack multimodal, multilabel frameworks and address sleep stages and disorders classification separately. In this paper, we propose a 1D-Vision Transformer for simultaneous classification of sleep stages and sleep disorders. Our method exploits the sleep disorders' correlation with specific sleep stage patterns and performs a simultaneous identification of a sleep stage and sleep disorder. The model is trained and tested using multimodal-multilabel sensory data (including photoplethysmogram, respiratory flow, and respiratory effort signals). The proposed method shows an overall accuracy (cohen's Kappa) of 78% (0.66) for five-stage sleep classification and 74% (0.58) for sleep apnea classification. Moreover, we analyzed the encoder attention weights to clarify our models' predictions and investigate the influence different features have on the models' outputs. The result shows that identified patterns, such as respiratory troughs and peaks, make a higher contribution to the final classification process.
中文: 本文提出一种一维视觉变换器模型,利用多模态数据同步分类睡眠阶段与睡眠障碍,在睡眠分期和呼吸暂停检测中分别达到78%和74%准确率,并通过注意力机制分析揭示了呼吸波峰谷等关键特征的重要贡献。
English: This paper introduces a 1D-Vision Transformer model that simultaneously classifies sleep stages and sleep disorders using multimodal data, achieving 78% accuracy for sleep staging and 74% for apnea detection while revealing key respiratory features' impact through attention analysis.

Authors:Yen-Ju Lu, Ting-Yao Hu, Hema Swetha Koppula, Hadi Pouransari, Jen-Hao Rick Chang, Yin Xia, Xiang Kong, Qi Zhu, Simon Wang, Oncel Tuzel, Raviteja Vemulapalli
Title: Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization
Abstract:
In this work, we propose Mutual Reinforcing Data Synthesis (MRDS) within LLMs to improve few-shot dialogue summarization task. Unlike prior methods that require external knowledge, we mutually reinforce the LLMś dialogue synthesis and summarization capabilities, allowing them to complement each other during training and enhance overall performances. The dialogue synthesis capability is enhanced by directed preference optimization with preference scoring from summarization capability. The summarization capability is enhanced by the additional high quality dialogue-summary paired data produced by the dialogue synthesis capability. By leveraging the proposed MRDS mechanism, we elicit the internal knowledge of LLM in the format of synthetic data, and use it to augment the few-shot real training dataset. Empirical results demonstrate that our method improves dialogue summarization, achieving a 1.5% increase in ROUGE scores and a 0.3% improvement in BERT scores in few-shot settings. Furthermore, our method attains the highest average scores in human evaluations, surpassing both the pre-trained models and the baselines fine-tuned solely for summarization tasks.
Chinese: 本研究提出互增强数据合成(MRDS)方法,通过强化大语言模型的对话合成与摘要能力相互促进,无需外部知识即可提升少样本对话摘要性能,在ROUGE和BERT评分及人工评估中均取得最优结果。
English: This study introduces Mutual Reinforcing Data Synthesis (MRDS), a method that enhances LLMs' few-shot dialogue summarization by mutually reinforcing dialogue synthesis and summarization capabilities, eliminating the need for external knowledge and improving performance metrics like ROUGE and BERT scores.

Authors:An-Lan Wang, Nuo Chen, Kun-Yu Lin, Li Yuan-Ming, Wei-Shi Zheng
Title: Task-Oriented 6-DoF Grasp Pose Detection in Clutters
Abstract:
In general, humans would grasp an object differently for different tasks, e.g., "grasping the handle of a knife to cut" vs. "grasping the blade to hand over". In the field of robotic grasp pose detection research, some existing works consider this task-oriented grasping and made some progress, but they are generally constrained by low-DoF gripper type or non-cluttered setting, which is not applicable for human assistance in real life. With an aim to get more general and practical grasp models, in this paper, we investigate the problem named Task-Oriented 6-DoF Grasp Pose Detection in Clutters (TO6DGC), which extends the task-oriented problem to a more general 6-DOF Grasp Pose Detection in Cluttered (multi-object) scenario. To this end, we construct a large-scale 6-DoF task-oriented grasping dataset, 6-DoF Task Grasp (6DTG), which features 4391 cluttered scenes with over 2 million 6-DoF grasp poses. Each grasp is annotated with a specific task, involving 6 tasks and 198 objects in total. Moreover, we propose One-Stage TaskGrasp (OSTG), a strong baseline to address the TO6DGC problem. Our OSTG adopts a task-oriented point selection strategy to detect where to grasp, and a task-oriented grasp generation module to decide how to grasp given a specific task. To evaluate the effectiveness of OSTG, extensive experiments are conducted on 6DTG. The results show that our method outperforms various baselines on multiple metrics. Real robot experiments also verify that our OSTG has a better perception of the task-oriented grasp points and 6-DoF grasp poses.
Chinese: 本文针对杂乱环境中的任务导向六自由度抓取位姿检测问题,提出OSTG模型并构建6DTG数据集,通过大量实验验证了该方法在多种指标上优于现有基线,有效提升了机器人对任务导向抓取点的感知能力。
English: This paper introduces a novel approach for task-oriented 6-DoF grasp pose detection in cluttered environments, proposing the OSTG model and 6DTG dataset to overcome limitations of prior methods and demonstrating superior performance through extensive experiments.

Authors:Ruochen Liu, Hao Chen, Yuanchen Bei, Zheyu Zhou, Lijia Chen, Qijie Shen, Feiran Huang, Fakhri Karray, Senzhang Wang
Title: FilterLLM: Text-To-Distribution LLM for Billion-Scale Cold-Start Recommendation
Abstract:
Large Language Model (LLM)-based cold-start recommendation systems continue to face significant computational challenges in billion-scale scenarios, as they follow a "Text-to-Judgment" paradigm. This approach processes user-item content pairs as input and evaluates each pair iteratively. To maintain efficiency, existing methods rely on pre-filtering a small candidate pool of user-item pairs. However, this severely limits the inferential capabilities of LLMs by reducing their scope to only a few hundred pre-filtered candidates. To overcome this limitation, we propose a novel "Text-to-Distribution" paradigm, which predicts an item's interaction probability distribution for the entire user set in a single inference. Specifically, we present FilterLLM, a framework that extends the next-word prediction capabilities of LLMs to billion-scale filtering tasks. FilterLLM first introduces a tailored distribution prediction and cold-start framework. Next, FilterLLM incorporates an efficient user-vocabulary structure to train and store the embeddings of billion-scale users. Finally, we detail the training objectives for both distribution prediction and user-vocabulary construction. The proposed framework has been deployed on the Alibaba platform, where it has been serving cold-start recommendations for two months, processing over one billion cold items. Extensive experiments demonstrate that FilterLLM significantly outperforms state-of-the-art methods in cold-start recommendation tasks, achieving over 30 times higher efficiency. Furthermore, an online A/B test validates its effectiveness in billion-scale recommendation systems.
Chinese: 针对大规模场景下基于大语言模型的冷启动推荐系统计算效率低的问题,本研究提出FilterLLM框架,通过"文本到分布"新范式单次推理即可预测项目在全用户集的交互概率分布,实际部署中实现效率提升30倍以上且性能显著优于现有方法。
English: To address the computational inefficiency of LLM-based cold-start recommendation systems in billion-scale scenarios, this study introduces FilterLLM, a novel "Text-to-Distribution" framework that predicts item interaction probability distributions across all users in a single inference, achieving over 30 times higher efficiency and superior performance in real-world deployment.

Authors:Yaozu Wu, Dongyuan Li, Yankai Chen, Renhe Jiang, Henry Peng Zou, Liancheng Fang, Zhen Wang, Philip S. Yu
Title: Multi-Agent Autonomous Driving Systems with Large Language Models: A Survey of Recent Advances
Abstract:
Autonomous Driving Systems (ADSs) are revolutionizing transportation by reducing human intervention, improving operational efficiency, and enhancing safety. Large Language Models (LLMs), known for their exceptional planning and reasoning capabilities, have been integrated into ADSs to assist with driving decision-making. However, LLM-based single-agent ADSs face three major challenges: limited perception, insufficient collaboration, and high computational demands. To address these issues, recent advancements in LLM-based multi-agent ADSs have focused on improving inter-agent communication and cooperation. This paper provides a frontier survey of LLM-based multi-agent ADSs. We begin with a background introduction to related concepts, followed by a categorization of existing LLM-based approaches based on different agent interaction modes. We then discuss agent-human interactions in scenarios where LLM-based agents engage with humans. Finally, we summarize key applications, datasets, and challenges in this field to support future research (https://anonymous.4open.science/r/LLM-based_Multi-agent_ADS-3A5C/README.md).
中文摘要:本文综述了基于大语言模型的多智能体自动驾驶系统,通过改进智能体间协作来克服单智能体系统的感知局限与计算挑战,并探讨了人机交互模式及应用前景。
English Summary: Large Language Models are being integrated into multi-agent autonomous driving systems to overcome single-agent limitations through enhanced collaboration, as surveyed in this paper covering interaction modes and human-agent scenarios.

Authors:Laurin Lux, Alexander H. Berger, Maria Romeo Tricas, Alaa E. Fayed, Sobha Sivaprasada, Linus Kreitner, Jonas Weidner, Martin J. Menten, Daniel Rueckert, Johannes C. Paetzold
Title: Interpretable Retinal Disease Prediction Using Biology-Informed Heterogeneous Graph Representations
Abstract:
Interpretability is crucial to enhance trust in machine learning models for medical diagnostics. However, most state-of-the-art image classifiers based on neural networks are not interpretable. As a result, clinicians often resort to known biomarkers for diagnosis, although biomarker-based classification typically performs worse than large neural networks. This work proposes a method that surpasses the performance of established machine learning models while simultaneously improving prediction interpretability for diabetic retinopathy staging from optical coherence tomography angiography (OCTA) images. Our method is based on a novel biology-informed heterogeneous graph representation that models retinal vessel segments, intercapillary areas, and the foveal avascular zone (FAZ) in a human-interpretable way. This graph representation allows us to frame diabetic retinopathy staging as a graph-level classification task, which we solve using an efficient graph neural network. We benchmark our method against well-established baselines, including classical biomarker-based classifiers, convolutional neural networks (CNNs), and vision transformers. Our model outperforms all baselines on two datasets. Crucially, we use our biology-informed graph to provide explanations of unprecedented detail. Our approach surpasses existing methods in precisely localizing and identifying critical vessels or intercapillary areas. In addition, we give informative and human-interpretable attributions to critical characteristics. Our work contributes to the development of clinical decision-support tools in ophthalmology.
中文: 本研究提出一种基于生物学知识的图神经网络方法,在OCTA图像糖尿病视网膜病变分期任务中性能优于现有技术,同时能够对关键视网膜特征提供详细且可解释的分析。
English: This study introduces a biology-informed graph neural network that outperforms existing methods in diabetic retinopathy staging from OCTA images while providing detailed, interpretable explanations of critical retinal features.

Authors:Hao-Shu Fang, Hengxu Yan, Zhenyu Tang, Hongjie Fang, Chenxi Wang, Cewu Lu
Title: AnyDexGrasp: General Dexterous Grasping for Different Hands with Human-level Learning Efficiency
Abstract:
We introduce an efficient approach for learning dexterous grasping with minimal data, advancing robotic manipulation capabilities across different robotic hands. Unlike traditional methods that require millions of grasp labels for each robotic hand, our method achieves high performance with human-level learning efficiency: only hundreds of grasp attempts on 40 training objects. The approach separates the grasping process into two stages: first, a universal model maps scene geometry to intermediate contact-centric grasp representations, independent of specific robotic hands. Next, a unique grasp decision model is trained for each robotic hand through real-world trial and error, translating these representations into final grasp poses. Our results show a grasp success rate of 75-95\% across three different robotic hands in real-world cluttered environments with over 150 novel objects, improving to 80-98\% with increased training objects. This adaptable method demonstrates promising applications for humanoid robots, prosthetics, and other domains requiring robust, versatile robotic manipulation.
中文: 本研究提出了一种数据高效的两阶段灵巧抓取方法,先通过通用接触表征建模,再针对不同机械手训练专属决策模型,仅需少量训练数据即可在多种机械手上实现75-98%的抓取成功率。
English: This study presents a data-efficient method for dexterous grasping that uses a two-stage process—first creating universal contact representations and then training hand-specific models—achieving 75-98% success rates across various robotic hands with minimal training data.

Authors:Thomas Debelle, Fahad Sohrab, Pekka Abrahamsson, Moncef Gabbouj
Title: Anomaly Detection in Smart Power Grids with Graph-Regularized MS-SVDD: a Multimodal Subspace Learning Approach
Abstract:
In this paper, we address an anomaly detection problem in smart power grids using Multimodal Subspace Support Vector Data Description (MS-SVDD). This approach aims to leverage better feature relations by considering the data as coming from different modalities. These data are projected into a shared lower-dimensionality subspace which aims to preserve their inner characteristics. To supplement the previous work on this subject, we introduce novel multimodal graph-embedded regularizers that leverage graph information for every modality to enhance the training process, and we consider an improved training equation that allows us to maximize or minimize each modality according to the specified criteria. We apply this regularized graph-embedded model on a 3-modalities dataset after having generalized MS-SVDD algorithms to any number of modalities. To set up our application, we propose a whole preprocessing procedure to extract One-Class Classification training instances from time-bounded event time series that are used to evaluate both the reliability and earliness of our model for Event Detection.
中文: 本文提出一种新型多模态图嵌入正则化方法,通过将多模态数据映射至共享子空间来改进智能电网异常检测,并采用事件检测的可靠性与及时性指标评估模型性能。
English: This paper introduces a novel multimodal graph-embedded regularizer to enhance anomaly detection in smart power grids using MS-SVDD, projecting multimodal data into a shared subspace and evaluating model performance through event detection reliability and earliness metrics.

Authors:Yilun Xu, Weili Nie, Arash Vahdat
Title: One-step Diffusion Models with $f$-Divergence Distribution Matching
Abstract:
Sampling from diffusion models involves a slow iterative process that hinders their practical deployment, especially for interactive applications. To accelerate generation speed, recent approaches distill a multi-step diffusion model into a single-step student generator via variational score distillation, which matches the distribution of samples generated by the student to the teacher's distribution. However, these approaches use the reverse Kullback-Leibler (KL) divergence for distribution matching which is known to be mode seeking. In this paper, we generalize the distribution matching approach using a novel $f$-divergence minimization framework, termed $f$-distill, that covers different divergences with different trade-offs in terms of mode coverage and training variance. We derive the gradient of the $f$-divergence between the teacher and student distributions and show that it is expressed as the product of their score differences and a weighting function determined by their density ratio. This weighting function naturally emphasizes samples with higher density in the teacher distribution, when using a less mode-seeking divergence. We observe that the popular variational score distillation approach using the reverse-KL divergence is a special case within our framework. Empirically, we demonstrate that alternative $f$-divergences, such as forward-KL and Jensen-Shannon divergences, outperform the current best variational score distillation methods across image generation tasks. In particular, when using Jensen-Shannon divergence, $f$-distill achieves current state-of-the-art one-step generation performance on ImageNet64 and zero-shot text-to-image generation on MS-COCO. Project page: https://research.nvidia.com/labs/genair/f-distill
中文摘要:本文提出的$f$-distill框架通过$f$-散度最小化改进了扩散模型蒸馏中的分布匹配方法,在使用Jensen-Shannon等散度时实现了超越现有最佳方法的单步生成性能。
English summary: This paper introduces $f$-distill, a framework that generalizes distribution matching in diffusion model distillation through $f$-divergence minimization, achieving superior one-step generation performance compared to previous methods using alternative divergences like Jensen-Shannon.

Authors:Tim Rädsch, Leon Mayer, Simon Pavicic, A. Emre Kavur, Marcel Knopp, Barış Öztürk, Klaus Maier-Hein, Paul F. Jaeger, Fabian Isensee, Annika Reinke, Lena Maier-Hein
Title: Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation
Abstract:
Reliable evaluation of AI models is critical for scientific progress and practical application. While existing VLM benchmarks provide general insights into model capabilities, their heterogeneous designs and limited focus on a few imaging domains pose significant challenges for both cross-domain performance comparison and targeted domain-specific evaluation. To address this, we propose three key contributions: (1) a framework for the resource-efficient creation of domain-specific VLM benchmarks enabled by task augmentation for creating multiple diverse tasks from a single existing task, (2) the release of new VLM benchmarks for seven domains, created according to the same homogeneous protocol and including 162,946 thoroughly human-validated answers, and (3) an extensive benchmarking of 22 state-of-the-art VLMs on a total of 37,171 tasks, revealing performance variances across domains and tasks, thereby supporting the need for tailored VLM benchmarks. Adoption of our methodology will pave the way for the resource-efficient domain-specific selection of models and guide future research efforts toward addressing core open questions.
Chinese: 本文提出一种通过任务增强高效构建领域特定VLM基准的框架,发布了七个包含人工验证数据的新基准,并在22个模型上验证了跨领域性能差异,强调定制化评估对模型选择与科研指导的重要性。
English: This paper introduces a framework for efficiently creating domain-specific VLM benchmarks through task augmentation, releases seven new benchmarks with human-validated data, and demonstrates performance variances across 22 models, advocating for tailored evaluations to guide model selection and research.

Authors:Anil Ramakrishna, Yixin Wan, Xiaomeng Jin, Kai-Wei Chang, Zhiqi Bu, Bhanukiran Vinzamuri, Volkan Cevher, Mingyi Hong, Rahul Gupta
Title: LUME: LLM Unlearning with Multitask Evaluations
Abstract:
Unlearning aims to remove copyrighted, sensitive, or private content from large language models (LLMs) without a full retraining. In this work, we develop a multi-task unlearning benchmark (LUME) which features three tasks: (1) unlearn synthetically generated creative short novels, (2) unlearn synthetic biographies with sensitive information, and (3) unlearn a collection of public biographies. We further release two fine-tuned LLMs of 1B and 7B parameter sizes as the target models. We conduct detailed evaluations of several recently proposed unlearning algorithms and present results on carefully crafted metrics to understand their behavior and limitations.
中文: 本研究提出了LUME多任务遗忘基准,旨在从大语言模型中移除受版权保护、敏感或私人内容,并在10亿和70亿参数模型上通过定制指标评估了多种遗忘算法的表现与局限。
English: This study introduces LUME, a multi-task benchmark for unlearning copyrighted, sensitive, and private content from large language models, and evaluates various unlearning algorithms using custom metrics on 1B and 7B parameter models.

Authors:Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang
Title: iAgent: LLM Agent as a Shield between User and Recommender Systems
Abstract:
Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure.
中文: 传统推荐系统常以平台利益为先,忽视用户真实需求,导致用户缺乏控制权和陷入信息茧房等问题,因此提出用户-代理-平台的新范式,通过间接接触保护用户权益。
English: Traditional recommender systems often prioritize platform benefits over user interests, leading to vulnerabilities like lack of user control and echo chambers, prompting the proposal of a new user-agent-platform paradigm for indirect exposure protection.

Authors:Vilém Zouhar, Maike Züfle, Beni Egressy, Julius Cheng, Mrinmaya Sachan, Jan Niehues
Title: Early-Exit and Instant Confidence Translation Quality Estimation
Abstract:
Quality estimation is omnipresent in machine translation, for both evaluation and generation. Unfortunately, quality estimation models are often opaque and computationally expensive, making them impractical to be part of large-scale pipelines. In this work, we tackle two connected challenges: (1) reducing the cost of quality estimation at scale, and (2) developing an inexpensive uncertainty estimation method for quality estimation. To address the latter, we introduce Instant Confidence COMET, an uncertainty-aware quality estimation model that matches the performance of previous approaches at a fraction of their costs. We extend this to Early-Exit COMET, a quality estimation model that can compute quality scores and associated confidences already at early model layers, allowing us to early-exit computations and reduce evaluation costs. We also apply our model to machine translation reranking. We combine Early-Exit COMET with an upper confidence bound bandit algorithm to find the best candidate from a large pool without having to run the full evaluation model on all candidates. In both cases (evaluation and reranking) our methods reduce the required compute by 50% with very little degradation in performance. Finally, we show how Instant Confidence COMET can be used to decide which translations a human evaluator should score rather than relying on the COMET score.
中文摘要:本研究提出Instant Confidence COMET和Early-Exit COMET模型,通过早期退出评估和不确定性感知评分,将机器翻译质量评估的计算需求减少50%,同时保持近乎无损的性能表现。
English Summary: This study introduces Instant Confidence COMET and Early-Exit COMET to reduce computational costs in machine translation quality estimation by enabling early-exit evaluations and uncertainty-aware scoring, cutting required compute by 50% with minimal performance loss.

Authors:Yichi Zhang, Le Xue, Wenbo Zhang, Lanlan Li, Yuchen Liu, Chen Jiang, Yuan Cheng, Yuan Qi
Title: SegAnyPET: Universal Promptable Segmentation from Positron Emission Tomography Images
Abstract:
Positron Emission Tomography (PET) is a powerful molecular imaging tool that plays a crucial role in modern medical diagnostics by visualizing radio-tracer distribution to reveal physiological processes. Accurate organ segmentation from PET images is essential for comprehensive multi-systemic analysis of interactions between different organs and pathologies. Existing segmentation methods are limited by insufficient annotation data and varying levels of annotation, resulting in weak generalization ability and difficulty in clinical application. Recent developments in segmentation foundation models have shown superior versatility across diverse segmentation tasks. Despite the efforts of medical adaptations, these works primarily focus on structural medical images with detailed physiological structural information and exhibit limited generalization performance on molecular PET imaging. In this paper, we collect and construct PETS-5k, the largest PET segmentation dataset to date, comprising 5,731 three-dimensional whole-body PET images and encompassing over 1.3M 2D images. Based on the established dataset, we develop SegAnyPET, a modality-specific 3D foundation model for universal promptable segmentation from PET images. To issue the challenge of discrepant annotation quality, we adopt a cross prompting confident learning (CPCL) strategy with an uncertainty-guided self-rectification process to robustly learn segmentation from high-quality labeled data and low-quality noisy labeled data for promptable segmentation. Experimental results demonstrate that SegAnyPET can segment seen and unseen target organs using only one or a few prompt points, outperforming state-of-the-art foundation models and task-specific fully supervised models with higher accuracy and strong generalization ability for universal segmentation.
Chinese: 正电子发射断层扫描(PET)在现代医学诊断中至关重要,但现有分割方法因标注数据不足而泛化能力弱;本研究基于构建的PETS-5k数据集开发了SegAnyPET模型,通过交叉提示置信学习策略实现高精度、强泛化能力的通用器官分割。
English: Positron Emission Tomography (PET) is essential for medical diagnostics but faces segmentation challenges due to limited annotations; this study introduces SegAnyPET, a 3D foundation model trained on the PETS-5k dataset, which achieves superior accuracy and generalization in organ segmentation using minimal prompts.

Authors:Shijin Duan, Yejia Liu, Gaowen Liu, Ramana Rao Kompella, Shaolei Ren, Xiaolin Xu
Title: Towards Vector Optimization on Low-Dimensional Vector Symbolic Architecture
Abstract:
Vector Symbolic Architecture (VSA) is emerging in machine learning due to its efficiency, but they are hindered by issues of hyperdimensionality and accuracy. As a promising mitigation, the Low-Dimensional Computing (LDC) method significantly reduces the vector dimension by ~100 times while maintaining accuracy, by employing a gradient-based optimization. Despite its potential, LDC optimization for VSA is still underexplored. Our investigation into vector updates underscores the importance of stable, adaptive dynamics in LDC training. We also reveal the overlooked yet critical roles of batch normalization (BN) and knowledge distillation (KD) in standard approaches. Besides the accuracy boost, BN does not add computational overhead during inference, and KD significantly enhances inference confidence. Through extensive experiments and ablation studies across multiple benchmarks, we provide a thorough evaluation of our approach and extend the interpretability of binary neural network optimization similar to LDC, previously unaddressed in BNN literature.
Chinese: 低维计算方法通过基于梯度的优化将向量维度降低约100倍且保持准确性,其中批归一化和知识蒸馏在提升推理效率和置信度方面发挥关键作用,且不增加额外计算负担。
English: The Low-Dimensional Computing method effectively reduces vector dimensions by ~100 times while preserving accuracy through gradient-based optimization, with batch normalization and knowledge distillation playing crucial roles in enhancing inference efficiency and confidence without added computational cost.

Authors:Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein
Title: d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining
Abstract:
Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.
中文: 本文提出一种新颖的草图到图像转换技术,通过轻量级映射网络利用预训练的大规模扩散模型,无需重新训练即可从粗略手绘草图中生成优质高分辨率真实图像。
English: This paper introduces a novel sketch-to-image translation technique that leverages a pre-trained large-scale diffusion model with a lightweight mapping network, achieving superior high-resolution realistic image generation from rough sketches without requiring retraining.

Authors:Yuliang Liu, Junjie Lu, Zhaoling Chen, Chaofeng Qu, Jason Klein Liu, Chonghan Liu, Zefan Cai, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin
Title: AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
Abstract:
Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.
中文摘要:AdaptiveStep提出了一种基于模型预测置信度动态划分推理步骤的新方法,用于训练过程奖励模型,在数学推理和代码生成任务中实现了最佳性能,同时无需人工标注且构建成本降低超过30%。
English Summary: AdaptiveStep introduces a novel method for training Process Reward Models by dynamically segmenting reasoning steps based on the model's confidence in predicting subsequent words, achieving state-of-the-art performance in mathematical reasoning and code generation while reducing construction costs by over 30% without manual annotation.

Authors:Marco Arazzi, Mert Cihangiroglu, Serena Nicolazzo, Antonino Nocera
Title: Secure Federated Data Distillation
Abstract:
Dataset Distillation (DD) is a powerful technique for reducing large datasets into compact, representative synthetic datasets, accelerating Machine Learning training. However, traditional DD methods operate in a centralized manner, which poses significant privacy threats and reduces its applicability. To mitigate these risks, we propose a Secure Federated Data Distillation (SFDD) framework to decentralize the distillation process while preserving privacy. Unlike existing Federated Distillation techniques that focus on training global models with distilled knowledge, our approach aims to produce a distilled dataset without exposing local contributions. We leverage the gradient-matching-based distillation method, adapting it for a distributed setting where clients contribute to the distillation process without sharing raw data. The central aggregator iteratively refines a synthetic dataset by integrating client-side updates while ensuring data confidentiality. To make our approach resilient to inference attacks perpetrated by the server that could exploit gradient updates to reconstruct private data, we create an optimized Local Differential Privacy approach, called LDPO-RLD. Furthermore, we assess the framework's resilience against malicious clients executing backdoor attacks (such as Doorping) and demonstrate robustness under the assumption of a sufficient number of participating clients. Our experimental results demonstrate the effectiveness of SFDD and that the proposed defense concretely mitigates the identified vulnerabilities, with minimal impact on the performance of the distilled dataset. By addressing the interplay between privacy and federation in dataset distillation, this work advances the field of privacy-preserving Machine Learning making our SFDD framework a viable solution for sensitive data-sharing applications.
Chinese: 安全联邦数据蒸馏(SFDD)框架通过梯度匹配和局部差分隐私技术,在分散式环境中实现数据集蒸馏以保护隐私,有效抵御推理和后门攻击等风险,同时保持数据集性能。
English: The Secure Federated Data Distillation (SFDD) framework decentralizes dataset distillation to preserve privacy by using gradient-matching and Local Differential Privacy, effectively mitigating risks like inference and backdoor attacks while maintaining dataset performance.

Authors:Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz
Title: MedViT V2: Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention
Abstract:
Convolutional networks, transformers, hybrid models, and Mamba-based architectures have demonstrated strong performance across various medical image classification tasks. However, these methods were primarily designed to classify clean images using labeled data. In contrast, real-world clinical data often involve image corruptions that are unique to multi-center studies and stem from variations in imaging equipment across manufacturers. In this paper, we introduce the Medical Vision Transformer (MedViTV2), a novel architecture incorporating Kolmogorov-Arnold Network (KAN) layers into the transformer architecture for the first time, aiming for generalized medical image classification. We have developed an efficient KAN block to reduce computational load while enhancing the accuracy of the original MedViT. Additionally, to counteract the fragility of our MedViT when scaled up, we propose an enhanced Dilated Neighborhood Attention (DiNA), an adaptation of the efficient fused dot-product attention kernel capable of capturing global context and expanding receptive fields to scale the model effectively and addressing feature collapse issues. Moreover, a hierarchical hybrid strategy is introduced to stack our Local Feature Perception and Global Feature Perception blocks in an efficient manner, which balances local and global feature perceptions to boost performance. Extensive experiments on 17 medical image classification datasets and 12 corrupted medical image datasets demonstrate that MedViTV2 achieved state-of-the-art results in 27 out of 29 experiments with reduced computational complexity. MedViTV2 is 44\% more computationally efficient than the previous version and significantly enhances accuracy, achieving improvements of 4.6\% on MedMNIST, 5.8\% on NonMNIST, and 13.4\% on the MedMNIST-C benchmark.
中文: MedViTV2通过引入Kolmogorov-Arnold网络层和增强注意力机制的新型Transformer架构,在医学图像分类任务中实现了最优性能,同时显著提升了计算效率和对图像损坏的鲁棒性。
English: MedViTV2 introduces a novel transformer architecture with Kolmogorov-Arnold Network layers and enhanced attention mechanisms, achieving state-of-the-art performance on medical image classification tasks while significantly improving computational efficiency and robustness against image corruptions.

Authors:Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz
Title: Medical Image Classification with KAN-Integrated Transformers and Dilated Neighborhood Attention
Abstract:
Convolutional networks, transformers, hybrid models, and Mamba-based architectures have demonstrated strong performance across various medical image classification tasks. However, these methods were primarily designed to classify clean images using labeled data. In contrast, real-world clinical data often involve image corruptions that are unique to multi-center studies and stem from variations in imaging equipment across manufacturers. In this paper, we introduce the Medical Vision Transformer (MedViTV2), a novel architecture incorporating Kolmogorov-Arnold Network (KAN) layers into the transformer architecture for the first time, aiming for generalized medical image classification. We have developed an efficient KAN block to reduce computational load while enhancing the accuracy of the original MedViT. Additionally, to counteract the fragility of our MedViT when scaled up, we propose an enhanced Dilated Neighborhood Attention (DiNA), an adaptation of the efficient fused dot-product attention kernel capable of capturing global context and expanding receptive fields to scale the model effectively and addressing feature collapse issues. Moreover, a hierarchical hybrid strategy is introduced to stack our Local Feature Perception and Global Feature Perception blocks in an efficient manner, which balances local and global feature perceptions to boost performance. Extensive experiments on 17 medical image classification datasets and 12 corrupted medical image datasets demonstrate that MedViTV2 achieved state-of-the-art results in 27 out of 29 experiments with reduced computational complexity. MedViTV2 is 44\% more computationally efficient than the previous version and significantly enhances accuracy, achieving improvements of 4.6\% on MedMNIST, 5.8\% on NonMNIST, and 13.4\% on the MedMNIST-C benchmark.
中文: MedViTV2通过引入Kolmogorov-Arnold网络层和增强注意力机制的新型Transformer架构,在医学图像分类任务中实现了最优性能,同时显著提升了计算效率和对图像损坏的鲁棒性。
English: MedViTV2 introduces a novel transformer architecture with Kolmogorov-Arnold Network layers and enhanced attention mechanisms, achieving state-of-the-art performance on medical image classification tasks while significantly improving computational efficiency and robustness against image corruptions.

Authors:Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein
Title: Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation
Abstract:
Human affordance learning investigates contextually relevant novel pose prediction such that the estimated pose represents a valid human action within the scene. While the task is fundamental to machine perception and automated interactive navigation agents, the exponentially large number of probable pose and action variations make the problem challenging and non-trivial. However, the existing datasets and methods for human affordance prediction in 2D scenes are significantly limited in the literature. In this paper, we propose a novel cross-attention mechanism to encode the scene context for affordance prediction by mutually attending spatial feature maps from two different modalities. The proposed method is disentangled among individual subtasks to efficiently reduce the problem complexity. First, we sample a probable location for a person within the scene using a variational autoencoder (VAE) conditioned on the global scene context encoding. Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding around the predicted location. In the subsequent steps, we use two VAEs to sample the scale and deformation parameters for the predicted pose template by conditioning on the local context and template class. Our experiments show significant improvements over the previous baseline of human affordance injection into complex 2D scenes.
中文摘要:本文提出了一种新颖的交叉注意力机制,通过变分自编码器将人体姿态预测分解为多个子任务,实现了在二维场景中预测符合场景背景的有效人体姿态,相比现有方法取得了显著改进。
English Summary: This paper introduces a novel cross-attention mechanism for human affordance learning that uses variational autoencoders to predict contextually valid human poses in 2D scenes through disentangled subtasks, demonstrating significant improvements over previous methods.

Authors:Jaemoon Lee, Xiao Li, Liangji Zhu, Sanjay Ranka, Anand Rangarajan
Title: Guaranteed Conditional Diffusion: 3D Block-based Models for Scientific Data Compression
Abstract:
This paper proposes a new compression paradigm -- Guaranteed Conditional Diffusion with Tensor Correction (GCDTC) -- for lossy scientific data compression. The framework is based on recent conditional diffusion (CD) generative models, and it consists of a conditional diffusion model, tensor correction, and error guarantee. Our diffusion model is a mixture of 3D conditioning and 2D denoising U-Net. The approach leverages a 3D block-based compressing module to address spatiotemporal correlations in structured scientific data. Then, the reverse diffusion process for 2D spatial data is conditioned on the ``slices'' of content latent variables produced by the compressing module. After training, the denoising decoder reconstructs the data with zero noise and content latent variables, and thus it is entirely deterministic. The reconstructed outputs of the CD model are further post-processed by our tensor correction and error guarantee steps to control and ensure a maximum error distortion, which is an inevitable requirement in lossy scientific data compression. Our experiments involving two datasets generated by climate and chemical combustion simulations show that our framework outperforms standard convolutional autoencoders and yields competitive compression quality with an existing scientific data compression algorithm.
中文: 本文提出GCDTC这一新型有损科学数据压缩框架,它结合条件扩散模型与张量校正来确保可控误差失真,在气候和燃烧模拟实验中优于传统自编码器并达到先进算法的压缩质量水平。
English: This paper introduces GCDTC, a novel lossy compression framework for scientific data that combines conditional diffusion models with tensor correction to ensure controlled error distortion, outperforming traditional autoencoders and matching state-of-the-art methods in climate and combustion simulations.

Authors:Mohammad Feli, Iman Azimi, Pasi Liljeberg, Amir M. Rahmani
Title: An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation
Abstract:
Large language models (LLMs) are revolutionizing healthcare by improving diagnosis, patient care, and decision support through interactive communication. More recently, they have been applied to analyzing physiological time-series like wearable data for health insight extraction. Existing methods embed raw numerical sequences directly into prompts, which exceeds token limits and increases computational costs. Additionally, some studies integrated features extracted from time-series in textual prompts or applied multimodal approaches. However, these methods often produce generic and unreliable outputs due to LLMs' limited analytical rigor and inefficiency in interpreting continuous waveforms. In this paper, we develop an LLM-powered agent for physiological time-series analysis aimed to bridge the gap in integrating LLMs with well-established analytical tools. Built on the OpenCHA, an open-source LLM-powered framework, our agent powered by OpenAI's GPT-3.5-turbo model features an orchestrator that integrates user interaction, data sources, and analytical tools to generate accurate health insights. To evaluate its effectiveness, we implement a case study on heart rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study. The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o, with ECG serving as the gold standard for HR estimation. Results demonstrate that our agent significantly outperforms benchmark models by achieving lower error rates and more reliable HR estimations. The agent implementation is publicly available on GitHub.
大语言模型通过整合成熟分析工具的新型智能体分析生理数据,相比现有模型显著提升了健康监测任务的准确性。
Large language models are advancing healthcare by analyzing physiological data through a new agent that integrates established tools, significantly improving accuracy in health monitoring tasks compared to existing models.

Authors:António Farinhas, Nuno M. Guerreiro, Sweta Agrawal, Ricardo Rei, André F. T. Martins
Title: Translate Smart, not Hard: Cascaded Translation Systems with Quality-Aware Deferral
Abstract:
Larger models often outperform smaller ones but come with high computational costs. Cascading offers a potential solution. By default, it uses smaller models and defers only some instances to larger, more powerful models. However, designing effective deferral rules remains a challenge. In this paper, we propose a simple yet effective approach for machine translation, using existing quality estimation (QE) metrics as deferral rules. We show that QE-based deferral allows a cascaded system to match the performance of a larger model while invoking it for a small fraction (30% to 50%) of the examples, significantly reducing computational costs. We validate this approach through both automatic and human evaluation.
Chinese: 通过将质量评估指标用作延迟规则,级联机器翻译系统仅需在30%至50%的案例中调用大型模型即可达到其性能水平,显著降低了计算成本,这一方法已通过自动和人工评估得到验证。
English: Using quality estimation metrics as deferral rules enables a cascaded machine translation system to achieve the performance of a larger model while only invoking it for 30% to 50% of cases, substantially cutting computational costs, as validated by automatic and human evaluations.

Authors:Bingning Wang, Haizhou Zhao, Huozhi Zhou, Liang Song, Mingyu Xu, Wei Cheng, Xiangrong Zeng, Yupeng Zhang, Yuqi Huo, Zecheng Wang, Zhengyun Zhao, Da Pan, Fei Kou, Fei Li, Fuzhong Chen, Guosheng Dong, Han Liu, Hongda Zhang, Jin He, Jinjie Yang, Kangxi Wu, Kegeng Wu, Lei Su, Linlin Niu, Linzhuang Sun, Mang Wang, Pengcheng Fan, Qianli Shen, Rihui Xin, Shunya Dang, Songchi Zhou, Weipeng Chen, Wenjing Luo, Xin Chen, Xin Men, Xionghai Lin, Xuezhen Dong, Yan Zhang, Yifei Duan, Yuyan Zhou, Zhi Ma, Zhiying Wu
Title: Baichuan-M1: Pushing the Medical Capability of Large Language Models
Abstract:
The current generation of large language models (LLMs) is typically designed for broad, general-purpose applications, while domain-specific LLMs, especially in vertical fields like medicine, remain relatively scarce. In particular, the development of highly efficient and practical LLMs for the medical domain is challenging due to the complexity of medical knowledge and the limited availability of high-quality data. To bridge this gap, we introduce Baichuan-M1, a series of large language models specifically optimized for medical applications. Unlike traditional approaches that simply continue pretraining on existing models or apply post-training to a general base model, Baichuan-M1 is trained from scratch with a dedicated focus on enhancing medical capabilities. Our model is trained on 20 trillion tokens and incorporates a range of effective training methods that strike a balance between general capabilities and medical expertise. As a result, Baichuan-M1 not only performs strongly across general domains such as mathematics and coding but also excels in specialized medical fields. We have open-sourced Baichuan-M1-14B, a mini version of our model, which can be accessed through the following links.
中文摘要:白川-M1是专为医疗应用从头训练的大语言模型系列,通过20万亿令牌的训练和优化方法,在通用领域和专业医疗领域均表现出色。
English Summary: Baichuan-M1 is a series of large language models specifically developed from scratch for medical applications, achieving strong performance in both general domains and specialized medical fields through training on 20 trillion tokens with optimized methods.

Authors:Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang
Title: Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research
Abstract:
The rapid advancement of perovskite solar cells (PSCs) has led to an exponential growth in research publications, creating an urgent need for efficient knowledge management and reasoning systems in this domain. We present a comprehensive knowledge-enhanced system for PSCs that integrates three key components. First, we develop Perovskite-KG, a domain-specific knowledge graph constructed from 1,517 research papers, containing 23,789 entities and 22,272 relationships. Second, we create two complementary datasets: Perovskite-Chat, comprising 55,101 high-quality question-answer pairs generated through a novel multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully curated materials science problems. Third, we introduce two specialized large language models: Perovskite-Chat-LLM for domain-specific knowledge assistance and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental results demonstrate that our system significantly outperforms existing models in both domain-specific knowledge retrieval and scientific reasoning tasks, providing researchers with effective tools for literature review, experimental design, and complex problem-solving in PSC research.
中文: 本研究针对钙钛矿太阳能电池领域开发了一套知识增强系统,包含专用知识图谱、问答与推理双数据集及专业大语言模型,在知识检索和科学推理任务中显著优于现有方法。
English: This study introduces a comprehensive knowledge-enhanced system for perovskite solar cells, featuring a domain-specific knowledge graph, dual datasets for Q&A and reasoning, and specialized large language models that outperform existing approaches in knowledge retrieval and scientific tasks.

Authors:Jiatao Li, Yanheng Li, Xinyu Hu, Mingqi Gao, Xiaojun Wan
Title: Aspect-Guided Multi-Level Perturbation Analysis of Large Language Models in Automated Peer Review
Abstract:
We propose an aspect-guided, multi-level perturbation framework to evaluate the robustness of Large Language Models (LLMs) in automated peer review. Our framework explores perturbations in three key components of the peer review process-papers, reviews, and rebuttals-across several quality aspects, including contribution, soundness, presentation, tone, and completeness. By applying targeted perturbations and examining their effects on both LLM-as-Reviewer and LLM-as-Meta-Reviewer, we investigate how aspect-based manipulations, such as omitting methodological details from papers or altering reviewer conclusions, can introduce significant biases in the review process. We identify several potential vulnerabilities: review conclusions that recommend a strong reject may significantly influence meta-reviews, negative or misleading reviews may be wrongly interpreted as thorough, and incomplete or hostile rebuttals can unexpectedly lead to higher acceptance rates. Statistical tests show that these biases persist under various Chain-of-Thought prompting strategies, highlighting the lack of robust critical evaluation in current LLMs. Our framework offers a practical methodology for diagnosing these vulnerabilities, thereby contributing to the development of more reliable and robust automated reviewing systems.
中文摘要:本研究提出一种基于指导性多维度扰动的评估框架,用于检验大语言模型在自动化同行评议中的稳健性,发现其对论文、评审和反驳中的特定扰动存在系统性偏差,这些偏差在不同思维链提示策略下依然持续存在。
English Summary: This study introduces an aspect-guided perturbation framework to assess LLM robustness in automated peer review, revealing systematic biases across papers, reviews, and rebuttals that persist despite different prompting strategies.

Authors:Zongyu Wu, Yuwei Niu, Hongcheng Gao, Minhua Lin, Zhiwei Zhang, Zhifang Zhang, Qi Shi, Yilong Wang, Sike Fu, Junjie Xu, Junjie Ao, Enyan Dai, Lei Feng, Xiang Zhang, Suhang Wang
Title: LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have shown impressive performance in various tasks. However, LVLMs suffer from hallucination, which hinders their adoption in the real world. Existing studies emphasized that the strong language priors of LVLMs can overpower visual information, causing hallucinations. However, the positive role of language priors is the key to a powerful LVLM. If the language priors are too weak, LVLMs will struggle to leverage rich parameter knowledge and instruction understanding abilities to complete tasks in challenging visual scenarios where visual information alone is insufficient. Therefore, we propose a benchmark called LanP to rethink the impact of Language Priors in LVLMs. It is designed to investigate how strong language priors are in current LVLMs. LanP consists of 170 images and 340 corresponding well-designed questions. Extensive experiments on 25 popular LVLMs reveal that many LVLMs' language priors are not strong enough to effectively aid question answering when objects are partially hidden. Many models, including GPT-4 Turbo, exhibit an accuracy below 0.5 in such a scenario.
中文摘要:大型视觉语言模型面临语言先验的双重挑战:过强会导致幻觉,过弱则难以应对视觉信息不足的复杂场景;新提出的LanP基准测试表明,当前多数模型在物体被部分遮挡时,其语言先验能力尚不足以有效辅助问题解答。
English Summary: Large Vision-Language Models face a delicate balance where overly strong language priors cause hallucinations, yet insufficient priors hinder performance in visually challenging scenarios, prompting the creation of the LanP benchmark which reveals current models' limitations in leveraging language knowledge when visual information is incomplete.

Authors:Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, Gordon Wetzstein
Title: FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
Abstract:
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes. Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis. Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds). The project page and code can be found at: https://zhanghe3z.github.io/FLARE/
Chinese: FLARE是一种前馈模型,能够从稀疏视图图像中高效推断相机姿态并重建三维几何结构,在姿态估计、几何重建和新视角合成任务中实现顶尖性能,且推理时间低于0.5秒。
English: FLARE is a feed-forward model that efficiently estimates camera poses and reconstructs 3D geometry from sparse-view images, achieving state-of-the-art results in pose estimation, geometry reconstruction, and novel-view synthesis in under 0.5 seconds.

Authors:Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan
Title: A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
Abstract:
In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.
中文: 本研究提出了一个双视角的自然语言生成元评估框架,通过关注不同评估能力提升可解释性并实现自动构建基准,同时基于16个大型语言模型的实验全面分析了其评估表现。
English: This study introduces a dual-perspective NLG meta-evaluation framework that enhances interpretability and enables automatic benchmark construction, with experimental validation using 16 LLMs to analyze their evaluation performance comprehensively.

Authors:Leo Schwinn, Yan Scholten, Tom Wollschläger, Sophie Xhonneux, Stephen Casper, Stephan Günnemann, Gauthier Gidel
Title: Adversarial Alignment for LLMs Requires Simpler, Reproducible, and More Measurable Objectives
Abstract:
Misaligned research objectives have considerably hindered progress in adversarial robustness research over the past decade. For instance, an extensive focus on optimizing target metrics, while neglecting rigorous standardized evaluation, has led researchers to pursue ad-hoc heuristic defenses that were seemingly effective. Yet, most of these were exposed as flawed by subsequent evaluations, ultimately contributing little measurable progress to the field. In this position paper, we illustrate that current research on the robustness of large language models (LLMs) risks repeating past patterns with potentially worsened real-world implications. To address this, we argue that realigned objectives are necessary for meaningful progress in adversarial alignment. To this end, we build on established cybersecurity taxonomy to formally define differences between past and emerging threat models that apply to LLMs. Using this framework, we illustrate that progress requires disentangling adversarial alignment into addressable sub-problems and returning to core academic principles, such as measureability, reproducibility, and comparability. Although the field presents significant challenges, the fresh start on adversarial robustness offers the unique opportunity to build on past experience while avoiding previous mistakes.
中文: 对抗性鲁棒性研究因目标错位而进展受阻,需通过重新调整目标、借鉴网络安全分类法,将LLM对抗对齐分解为可解决的子问题,并回归可测量、可复现的核心学术原则。
English: Misaligned research objectives have hindered adversarial robustness progress, requiring realigned goals and cybersecurity frameworks to address LLM threats through measurable, reproducible approaches.

Authors:Yahao Pang, Xingyuan Wu, Xiaojin Zhang, Wei Chen, Hai Jin
Title: FedEAT: A Robustness Optimization Framework for Federated LLMs
Abstract:
Significant advancements have been made by Large Language Models (LLMs) in the domains of natural language understanding and automated content creation. However, they still face persistent problems, including substantial computational costs and inadequate availability of training data. The combination of Federated Learning (FL) and LLMs (federated LLMs) offers a solution by leveraging distributed data while protecting privacy, which positions it as an ideal choice for sensitive domains. However, Federated LLMs still suffer from robustness challenges, including data heterogeneity, malicious clients, and adversarial attacks, which greatly hinder their applications. We first introduce the robustness problems in federated LLMs, to address these challenges, we propose FedEAT (Federated Embedding space Adversarial Training), a novel framework that applies adversarial training in the embedding space of client LLM and employs a robust aggregation approach, specifically geometric median aggregation, to enhance the robustness of Federated LLMs. Our experiments demonstrate that FedEAT effectively improves the robustness of Federated LLMs with minimal performance loss.
Chinese: 大语言模型面临计算与数据挑战,联邦学习提供了隐私保护解决方案,但其鲁棒性问题通过我们提出的FedEAT框架得以有效解决,该框架采用嵌入空间对抗训练和鲁棒聚合方法。
English: Large Language Models face computational and data challenges, but Federated Learning offers a privacy-preserving solution, though it suffers from robustness issues that our proposed FedEAT framework effectively addresses through embedding space adversarial training and robust aggregation.

Authors:Xuan Tong, Yang Chang, Qing Zhao, Jiawen Yu, Boyang Wang, Junxiong Lin, Yuxuan Lin, Xinji Mai, Haoran Wang, Zeng Tao, Yan Wang, Wenqiang Zhang
Title: Component-aware Unsupervised Logical Anomaly Generation for Industrial Anomaly Detection
Abstract:
Anomaly detection is critical in industrial manufacturing for ensuring product quality and improving efficiency in automated processes. The scarcity of anomalous samples limits traditional detection methods, making anomaly generation essential for expanding the data repository. However, recent generative models often produce unrealistic anomalies increasing false positives, or require real-world anomaly samples for training. In this work, we treat anomaly generation as a compositional problem and propose ComGEN, a component-aware and unsupervised framework that addresses the gap in logical anomaly generation. Our method comprises a multi-component learning strategy to disentangle visual components, followed by subsequent generation editing procedures. Disentangled text-to-component pairs, revealing intrinsic logical constraints, conduct attention-guided residual mapping and model training with iteratively matched references across multiple scales. Experiments on the MVTecLOCO dataset confirm the efficacy of ComGEN, achieving the best AUROC score of 91.2%. Additional experiments on the real-world scenario of Diesel Engine and widely-used MVTecAD dataset demonstrate significant performance improvements when integrating simulated anomalies generated by ComGEN into automated production workflows.
中文摘要:由于异常样本稀缺,异常生成在工业制造中至关重要;所提出的ComGEN框架将异常生成视为组合问题,通过组件感知的无监督方法解决了现有模型的局限性,在基准数据集上取得了最优性能。
English Summary: Anomaly generation is essential in industrial manufacturing due to scarce anomalous samples, and the proposed ComGEN framework addresses limitations of existing models by treating anomaly generation as a compositional problem, achieving state-of-the-art performance on benchmark datasets.

Authors:Mukun Chen, Jia Wu, Shirui Pan, Fu Lin, Bo Du, Xiuwen Gong, Wenbin Hu
Title: Knowledge-aware contrastive heterogeneous molecular graph learning
Abstract:
Molecular representation learning is pivotal in predicting molecular properties and advancing drug design. Traditional methodologies, which predominantly rely on homogeneous graph encoding, are limited by their inability to integrate external knowledge and represent molecular structures across different levels of granularity. To address these limitations, we propose a paradigm shift by encoding molecular graphs into heterogeneous structures, introducing a novel framework: Knowledge-aware Contrastive Heterogeneous Molecular Graph Learning (KCHML). This approach leverages contrastive learning to enrich molecular representations with embedded external knowledge. KCHML conceptualizes molecules through three distinct graph views-molecular, elemental, and pharmacological-enhanced by heterogeneous molecular graphs and a dual message-passing mechanism. This design offers a comprehensive representation for property prediction, as well as for downstream tasks such as drug-drug interaction (DDI) prediction. Extensive benchmarking demonstrates KCHML's superiority over state-of-the-art molecular property prediction models, underscoring its ability to capture intricate molecular features.
中文: 提出的KCHML框架通过异构图编码和对比学习整合外部知识,提升了分子性质预测性能,显著优于现有方法。
English: The proposed KCHML framework enhances molecular property prediction by integrating external knowledge through heterogeneous graph encoding and contrastive learning, outperforming existing methods.

Authors:Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso
Title: Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens
Abstract:
Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these modules produce interpretable concepts and behave reliably in out-of-distribution is crucial, yet the conditions for achieving this remain unclear. We study this problem by establishing a novel connection between Concept-based Models and reasoning shortcuts (RSs), a common issue where models achieve high accuracy by learning low-quality concepts, even when the inference layer is fixed and provided upfront. Specifically, we first extend RSs to the more complex setting of Concept-based Models and then derive theoretical conditions for identifying both the concepts and the inference layer. Our empirical results highlight the impact of reasoning shortcuts and show that existing methods, even when combined with multiple natural mitigation strategies, often fail to meet these conditions in practice.
Chinese Summary: 基于概念的模型因推理捷径问题难以保证概念可解释性和分布外可靠性,理论与实证研究表明,即使推理层固定,现有方法结合多种缓解策略仍常无法满足条件。
English Summary: Concept-based Models face challenges in ensuring interpretable concepts and reliability, primarily due to reasoning shortcuts that compromise quality even with a fixed inference layer, as theoretical and empirical findings reveal current methods often fall short.

Authors:Ken Lin, Qi Ye, Tin Lun Lam, Zhibin Li, Jiming Chen, Gaofeng Li
Title: Motion planning for highly-dynamic unconditioned reflexes based on chained Signed Distance Functions
Abstract:
The unconditioned reflex (e.g., protective reflex), which is the innate reaction of the organism and usually performed through the spinal cord rather than the brain, can enable organisms to escape harms from environments. In this paper, we propose an online, highly-dynamic motion planning algorithm to endow manipulators the highly-dynamic unconditioned reflexes to humans and/or environments. Our method is based on a chained version of Signed Distance Functions (SDFs), which can be pre-computed and stored. Our proposed algorithm is divided into two stages. In the offline stage, we create 3 groups of local SDFs to store the geometric information of the manipulator and its working environment. In the online stage, the pre-computed local SDFs are chained together according the configuration of the manipulator, to provide global geometric information about the environment. While the point clouds of the dynamic objects serve as query points to look up these local SDFs for quickly generating escape velocity. Then we propose a modified geometric Jacobian matrix and use the Jacobian-pseudo-inverse method to generate real-time reflex behaviors to avoid the static and dynamic obstacles in the environment. The benefits of our method are validated in both static and dynamic scenarios. In the static scenario, our method identifies the path solutions with lower time consumption and shorter trajectory length compared to existing solutions. In the dynamic scenario, our method can reliably pursue the dynamic target point, avoid dynamic obstacles, and react to these obstacles within 1ms, which surpasses the unconditioned reflex reaction time of humans.
中文摘要:本文提出一种在线运动规划算法,通过预计算链式符号距离函数,赋予机械臂超越人类反应速度的高度动态无条件反射能力,实现实时避障。
English Summary: This paper introduces an online motion planning algorithm that equips robotic manipulators with highly-dynamic unconditioned reflexes, using pre-computed chained Signed Distance Functions to generate real-time obstacle avoidance reactions faster than human reflex times.

Authors:Tim Beyer, Jan Schuchardt, Leo Schwinn, Stephan Günnemann
Title: Fast Proxies for LLM Robustness Evaluation
Abstract:
Evaluating the robustness of LLMs to adversarial attacks is crucial for safe deployment, yet current red-teaming methods are often prohibitively expensive. We compare the ability of fast proxy metrics to predict the real-world robustness of an LLM against a simulated attacker ensemble. This allows us to estimate a model's robustness to computationally expensive attacks without requiring runs of the attacks themselves. Specifically, we consider gradient-descent-based embedding-space attacks, prefilling attacks, and direct prompting. Even though direct prompting in particular does not achieve high ASR, we find that it and embedding-space attacks can predict attack success rates well, achieving $r_p=0.87$ (linear) and $r_s=0.94$ (Spearman rank) correlations with the full attack ensemble while reducing computational cost by three orders of magnitude.
中文: 本研究证明,计算高效的代理方法(特别是直接提示和嵌入空间攻击)能够以高相关性准确预测大语言模型对抗成本高昂的对抗攻击的鲁棒性,同时显著降低计算开销。
English: This study demonstrates that computationally efficient proxy methods, particularly direct prompting and embedding-space attacks, can accurately predict large language models' robustness against costly adversarial attacks with high correlation and significantly reduced computational expenses.

Authors:Yuhang Dong, Haizhou Ge, Yupei Zeng, Jiangning Zhang, Beiwen Tian, Guanzhong Tian, Hongrui Zhu, Yufei Jia, Ruixiang Wang, Ran Yi, Guyue Zhou, Longhua Ma
Title: Imit Diff: Semantics Guided Diffusion Transformer with Dual Resolution Fusion for Imitation Learning
Abstract:
Visuomotor imitation learning enables embodied agents to effectively acquire manipulation skills from video demonstrations and robot proprioception. However, as scene complexity and visual distractions increase, existing methods that perform well in simple scenes tend to degrade in performance. To address this challenge, we introduce Imit Diff, a semanstic guided diffusion transformer with dual resolution fusion for imitation learning. Our approach leverages prior knowledge from vision language foundation models to translate high-level semantic instruction into pixel-level visual localization. This information is explicitly integrated into a multi-scale visual enhancement framework, constructed with a dual resolution encoder. Additionally, we introduce an implementation of Consistency Policy within the diffusion transformer architecture to improve both real-time performance and motion smoothness in embodied agent control.We evaluate Imit Diff on several challenging real-world tasks. Due to its task-oriented visual localization and fine-grained scene perception, it significantly outperforms state-of-the-art methods, especially in complex scenes with visual distractions, including zero-shot experiments focused on visual distraction and category generalization. The code will be made publicly available.
中文: Imit Diff 是一种语义引导的扩散变换器,通过融合视觉语言模型将高级语义指令转化为像素级视觉定位,在复杂和视觉干扰场景中显著优于现有模仿学习方法。
English: Imit Diff is a semantic-guided diffusion transformer that integrates vision-language models to enhance imitation learning by translating semantic instructions into precise visual localization, significantly outperforming existing methods in complex, visually distracting scenes.

Authors:Stephen Casper, David Krueger, Dylan Hadfield-Menell
Title: Pitfalls of Evidence-Based AI Policy
Abstract:
Nations across the world are working to govern AI. However, from a technical perspective, there is uncertainty and disagreement on the best way to do this. Meanwhile, recent debates over AI regulation have led to calls for "evidence-based AI policy" which emphasize holding regulatory action to a high evidentiary standard. Evidence is of irreplaceable value to policymaking. However, holding regulatory action to too high an evidentiary standard can lead to systematic neglect of certain risks. In historical policy debates (e.g., over tobacco ca. 1965 and fossil fuels ca. 1985) "evidence-based policy" rhetoric is also a well-precedented strategy to downplay the urgency of action, delay regulation, and protect industry interests. Here, we argue that if the goal is evidence-based AI policy, the first regulatory objective must be to actively facilitate the process of identifying, studying, and deliberating about AI risks. We discuss a set of 15 regulatory goals to facilitate this and show that Brazil, Canada, China, the EU, South Korea, the UK, and the USA all have substantial opportunities to adopt further evidence-seeking policies.
中文: 全球AI治理面临技术不确定性和证据标准争议,过度强调证据可能延误监管并忽视风险,因此需将主动识别与研究AI风险作为首要监管目标,推动各国采纳证据收集政策。
English: Global efforts to regulate AI face technical uncertainties and debates over evidence-based policies, which risk delaying action and overlooking dangers, necessitating regulatory goals that actively investigate AI risks and promote evidence-gathering across nations.

Authors:Kun Li, Yida Xiong, Hongzhi Zhang, Xiantao Cai, Jia Wu, Bo Du, Wenbin Hu
Title: Graph-structured Small Molecule Drug Discovery Through Deep Learning: Progress, Challenges, and Opportunities
Abstract:
Due to their excellent drug-like and pharmacokinetic properties, small molecule drugs are widely used to treat various diseases, making them a critical component of drug discovery. In recent years, with the rapid development of deep learning (DL) techniques, DL-based small molecule drug discovery methods have achieved excellent performance in prediction accuracy, speed, and complex molecular relationship modeling compared to traditional machine learning approaches. These advancements enhance drug screening efficiency and optimization and provide more precise and effective solutions for various drug discovery tasks. Contributing to this field's development, this paper aims to systematically summarize and generalize the recent key tasks and representative techniques in graph-structured small molecule drug discovery in recent years. Specifically, we provide an overview of the major tasks in small molecule drug discovery and their interrelationships. Next, we analyze the six core tasks, summarizing the related methods, commonly used datasets, and technological development trends. Finally, we discuss key challenges, such as interpretability and out-of-distribution generalization, and offer our insights into future research directions for small molecule drug discovery.
中文摘要:基于深度学习的小分子药物发现方法在预测精度、速度和复杂分子关系建模方面优于传统机器学习,提升了药物筛选效率并提供了精准解决方案,本文系统梳理了该领域核心任务、代表性技术、数据集及未来研究方向。
English Summary: Deep learning-based methods for small molecule drug discovery have surpassed traditional machine learning in accuracy, speed, and molecular modeling, enhancing drug screening efficiency and providing precise solutions, while this paper systematically reviews key tasks, techniques, datasets, challenges, and future directions in the field.

Authors:Anupama Sitaraman, Adam Lechowicz, Noman Bashir, Xutong Liu, Mohammad Hajiesmaili, Prashant Shenoy
Title: Dynamic Incentive Allocation for City-scale Deep Decarbonization
Abstract:
Greenhouse gas emissions from the residential sector represent a significant fraction of global emissions. Governments and utilities have designed incentives to stimulate the adoption of decarbonization technologies such as rooftop PV and heat pumps. However, studies have shown that many of these incentives are inefficient since a substantial fraction of spending does not actually promote adoption, and incentives are not equitably distributed across socioeconomic groups. We present a novel data-driven approach that adopts a holistic, emissions-based and city-scale perspective on decarbonization. We propose an optimization model that dynamically allocates a total incentive budget to households to directly maximize city-wide carbon reduction. We leverage techniques for the multi-armed bandits problem to estimate human factors, such as a household's willingness to adopt new technologies given a certain incentive. We apply our proposed framework to a city in the Northeast U.S., using real household energy data, grid carbon intensity data, and future price scenarios. We show that our learning-based technique significantly outperforms an example status quo incentive scheme, achieving up to 32.23% higher carbon reductions. We show that our framework can accommodate equity-aware constraints to equitably allocate incentives across socioeconomic groups, achieving 78.84% of the carbon reductions of the optimal solution on average.
中文摘要:本研究提出一种数据驱动的优化模型,通过动态分配家庭激励资金直接实现城市碳减排最大化,相比现行激励方案显著提升了减排效果与公平性。
English Summary: This study introduces a data-driven optimization model that dynamically allocates incentives to households to maximize city-wide carbon reduction, demonstrating significantly higher effectiveness and equity than current incentive schemes.

Authors:Nikos Zarifis, Puqian Wang, Ilias Diakonikolas, Jelena Diakonikolas
Title: Robustly Learning Monotone Generalized Linear Models via Data Augmentation
Abstract:
We study the task of learning Generalized Linear models (GLMs) in the agnostic model under the Gaussian distribution. We give the first polynomial-time algorithm that achieves a constant-factor approximation for \textit{any} monotone Lipschitz activation. Prior constant-factor GLM learners succeed for a substantially smaller class of activations. Our work resolves a well-known open problem, by developing a robust counterpart to the classical GLMtron algorithm (Kakade et al., 2011). Our robust learner applies more generally, encompassing all monotone activations with bounded $(2+ζ)$-moments, for any fixed $ζ>0$ -- a condition that is essentially necessary. To obtain our results, we leverage a novel data augmentation technique with decreasing Gaussian noise injection and prove a number of structural results that may be useful in other settings.
Chinese: 本研究首次提出了在高斯分布下学习任意单调Lipschitz广义线性模型的多项式时间算法,通过开发GLMtron算法的鲁棒扩展,解决了这一长期存在的开放性问题,实现了常数因子近似。
English: This work presents the first polynomial-time algorithm that achieves constant-factor approximation for learning any monotone Lipschitz Generalized Linear Model under Gaussian distribution, resolving a long-standing open problem by developing a robust extension of the GLMtron algorithm.

Authors:Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason Weston, Xian Li
Title: LLM Pretraining with Continuous Concepts
Abstract:
Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.
Chinese: CoCoMix是一种新颖的预训练框架,通过将连续概念预测与离散标记预测相结合,以端到端的概念学习和交错方式提升模型性能、样本效率及可解释性。
English: CoCoMix is a novel pretraining framework that enhances language models by integrating continuous concept prediction with discrete token prediction, improving performance, sample efficiency, and interpretability through end-to-end concept learning and interleaving.

Authors:Heejin Do, Taehee Park, Sangwon Ryu, Gary Geunbae Lee
Title: Towards Prompt Generalization: Grammar-aware Cross-Prompt Automated Essay Scoring
Abstract:
In automated essay scoring (AES), recent efforts have shifted toward cross-prompt settings that score essays on unseen prompts for practical applicability. However, prior methods trained with essay-score pairs of specific prompts pose challenges in obtaining prompt-generalized essay representation. In this work, we propose a grammar-aware cross-prompt trait scoring (GAPS), which internally captures prompt-independent syntactic aspects to learn generic essay representation. We acquire grammatical error-corrected information in essays via the grammar error correction technique and design the AES model to seamlessly integrate such information. By internally referring to both the corrected and the original essays, the model can focus on generic features during training. Empirical experiments validate our method's generalizability, showing remarkable improvements in prompt-independent and grammar-related traits. Furthermore, GAPS achieves notable QWK gains in the most challenging cross-prompt scenario, highlighting its strength in evaluating unseen prompts.
中文: 本研究提出GAPS方法,通过语法纠错技术获取与题目无关的句法特征,在跨题目作文评分中展现出卓越的泛化能力,显著提升了未见题目的评估效果。
English: This study introduces GAPS, a grammar-aware cross-prompt scoring method that uses grammatical error correction to capture prompt-independent syntactic features, achieving superior generalizability and significant improvements in cross-prompt essay evaluation.

Authors:Shahbaz Siddeeq, Zeeshan Rasheed, Malik Abdul Sami, Mahade Hasan, Muhammad Waseem, Jussi Rasku, Mika Saari, Kai-Kristian Kemell, Pekka Abrahamsson
Title: Distributed Approach to Haskell Based Applications Refactoring with LLMs Based Multi-Agent Systems
Abstract:
We present a large language models (LLMs) based multi-agent system to automate the refactoring of Haskell codebases. The multi-agent system consists of specialized agents performing tasks such as context analysis, refactoring, validation, and testing. Refactoring improvements are using metrics such as cyclomatic complexity, run-time, and memory allocation. Experimental evaluations conducted on Haskell codebases demonstrate improvements in code quality. Cyclomatic complexity was reduced by 13.64% and 47.06% in the respective codebases. Memory allocation improved by 4.17% and 41.73%, while runtime efficiency increased by up to 50%. These metrics highlight the systems ability to optimize Haskells functional paradigms while maintaining correctness and scalability. Results show reductions in complexity and performance enhancements across codebases. The integration of LLMs based multi-agent system enables precise task execution and inter-agent collaboration, addressing the challenges of refactoring in functional programming. This approach aims to address the challenges of refactoring functional programming languages through distributed and modular systems.
Chinese: 本研究提出了一种基于大语言模型的多智能体系统,用于自动化重构Haskell代码,显著降低了圈复杂度、内存占用和运行时间,同时提升了代码质量与可扩展性。
English: This study introduces a multi-agent system powered by large language models that automates Haskell code refactoring, achieving significant reductions in cyclomatic complexity, memory usage, and runtime while improving code quality and scalability.

Authors:Ao Li, Wei Fang, Hongbo Zhao, Le Lu, Ge Yang, Minfeng Xu
Title: MaRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers
Abstract:
In applications of diffusion models, controllable generation is of practical significance, but is also challenging. Current methods for controllable generation primarily focus on modifying the score function of diffusion models, while Mean Reverting (MR) Diffusion directly modifies the structure of the stochastic differential equation (SDE), making the incorporation of image conditions simpler and more natural. However, current training-free fast samplers are not directly applicable to MR Diffusion. And thus MR Diffusion requires hundreds of NFEs (number of function evaluations) to obtain high-quality samples. In this paper, we propose a new algorithm named MaRS (MR Sampler) to reduce the sampling NFEs of MR Diffusion. We solve the reverse-time SDE and the probability flow ordinary differential equation (PF-ODE) associated with MR Diffusion, and derive semi-analytical solutions. The solutions consist of an analytical function and an integral parameterized by a neural network. Based on this solution, we can generate high-quality samples in fewer steps. Our approach does not require training and supports all mainstream parameterizations, including noise prediction, data prediction and velocity prediction. Extensive experiments demonstrate that MR Sampler maintains high sampling quality with a speedup of 10 to 20 times across ten different image restoration tasks. Our algorithm accelerates the sampling procedure of MR Diffusion, making it more practical in controllable generation.
中文摘要:本文提出无需训练的MaRS算法,通过半解析求解均值回复扩散的随机微分方程和概率流常微分方程,将采样速度提升10-20倍,在保持高质量生成的同时显著提高了可控生成效率。
English Summary: The paper introduces MaRS, a training-free algorithm that accelerates Mean Reverting Diffusion sampling by 10-20 times through semi-analytical solutions to its SDE and PF-ODE, enabling high-quality controllable generation with fewer steps.

Authors:Mukund Agarwalla, Himanshu Kumar, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Title: NanoVLMs: How small can we go and still make coherent Vision Language Models?
Abstract:
Vision-Language Models (VLMs), such as GPT-4V and Llama 3.2 vision, have garnered significant research attention for their ability to leverage Large Language Models (LLMs) in multimodal tasks. However, their potential is constrained by inherent challenges, including proprietary restrictions, substantial computational demands, and limited accessibility. Smaller models, such as GIT and BLIP, exhibit marked limitations, often failing to generate coherent and consistent text beyond a few tokens, even with extensive training. This underscores a pivotal inquiry: how small can a VLM be and still produce fluent and consistent text? Drawing inspiration from the exceptional learning process of 3-4 year old children, who rely heavily on visual cues for understanding and communication, we introduce two novel datasets: ShortDesc (featuring concise image descriptions) and LongDesc (containing more detailed image descriptions). These datasets consist of image-text pairs where the text is restricted to the simple vocabulary and syntax typically used by young children, generated with a scaled-down model, GPT-4o. Using these datasets, we demonstrate that it is possible to train VLMs that are significantly smaller, up to 10 times smaller than state of the art(SOTA) small VLMs while maintaining architectural simplicity. To evaluate the outputs, we leverage GPT-4o to grade the text, as if stories written by students, on creativity, meaningfulness, and consistency, assigning scores out of 10. This method addresses limitations of standard benchmarks by accommodating unstructured outputs and providing a multidimensional evaluation of the model capabilities. Our findings contribute to the development of lightweight, accessible multimodal models for resource constrained environments.
中文: 视觉语言模型在规模和可访问性方面存在局限,本研究通过引入儿童启发式数据集,成功训练出显著更小的模型,在保持文本流畅性和一致性的同时实现了多维能力评估。
English: Vision-Language Models face limitations in size and accessibility, but this study introduces child-inspired datasets to train significantly smaller models that maintain text fluency and consistency while enabling multidimensional evaluation.

Authors:Feng Liang, Haoyu Ma, Zecheng He, Tingbo Hou, Ji Hou, Kunpeng Li, Xiaoliang Dai, Felix Juefei-Xu, Samaneh Azadi, Animesh Sinha, Peizhao Zhang, Peter Vajda, Diana Marculescu
Title: Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts
Abstract:
Video personalization, which generates customized videos using reference images, has gained significant attention. However, prior methods typically focus on single-concept personalization, limiting broader applications that require multi-concept integration. Attempts to extend these models to multiple concepts often lead to identity blending, which results in composite characters with fused attributes from multiple sources. This challenge arises due to the lack of a mechanism to link each concept with its specific reference image. We address this with anchored prompts, which embed image anchors as unique tokens within text prompts, guiding accurate referencing during generation. Additionally, we introduce concept embeddings to encode the order of reference images. Our approach, Movie Weaver, seamlessly weaves multiple concepts-including face, body, and animal images-into one video, allowing flexible combinations in a single model. The evaluation shows that Movie Weaver outperforms existing methods for multi-concept video personalization in identity preservation and overall quality.
中文:Movie Weaver通过锚定提示和概念嵌入技术,实现了多概念视频个性化生成,有效避免了身份混合问题,在保持各参考图像独立特征的同时,其生成质量优于现有方法。
English: Movie Weaver introduces anchored prompts and concept embeddings to enable seamless multi-concept video personalization, effectively preventing identity blending while preserving distinct attributes from multiple reference images and outperforming existing methods in quality.

Authors:Italo Santos, Katia Romero Felizardo, Igor Steinmacher, Marco A. Gerosa
Title: Great Power Brings Great Responsibility: Personalizing Conversational AI for Diverse Problem-Solvers
Abstract:
Newcomers onboarding to Open Source Software (OSS) projects face many challenges. Large Language Models (LLMs), like ChatGPT, have emerged as potential resources for answering questions and providing guidance, with many developers now turning to ChatGPT over traditional Q&A sites like Stack Overflow. Nonetheless, LLMs may carry biases in presenting information, which can be especially impactful for newcomers whose problem-solving styles may not be broadly represented. This raises important questions about the accessibility of AI-driven support for newcomers to OSS projects. This vision paper outlines the potential of adapting AI responses to various problem-solving styles to avoid privileging a particular subgroup. We discuss the potential of AI persona-based prompt engineering as a strategy for interacting with AI. This study invites further research to refine AI-based tools to better support contributions to OSS projects.
中文: 开源软件项目的新手越来越多地依赖ChatGPT等大型语言模型获取指导,但这些AI工具可能存在偏见,不利于某些问题解决风格,因此需要可适应的AI响应来确保公平支持。
English: Newcomers to open source software projects increasingly rely on large language models like ChatGPT for guidance, but these AI tools may exhibit biases that disadvantage certain problem-solving styles, highlighting the need for adaptable AI responses to ensure equitable support.

Authors:Yuanxun Lu, Jingyang Zhang, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao, Shiwei Li
Title: Matrix3D: Large Photogrammetry Model All-in-One
Abstract:
We present Matrix3D, a unified model that performs several photogrammetry subtasks, including pose estimation, depth prediction, and novel view synthesis using just the same model. Matrix3D utilizes a multi-modal diffusion transformer (DiT) to integrate transformations across several modalities, such as images, camera parameters, and depth maps. The key to Matrix3D's large-scale multi-modal training lies in the incorporation of a mask learning strategy. This enables full-modality model training even with partially complete data, such as bi-modality data of image-pose and image-depth pairs, thus significantly increases the pool of available training data. Matrix3D demonstrates state-of-the-art performance in pose estimation and novel view synthesis tasks. Additionally, it offers fine-grained control through multi-round interactions, making it an innovative tool for 3D content creation. Project page: https://nju-3dv.github.io/projects/matrix3d.
Chinese: Matrix3D是一个统一的摄影测量模型,采用多模态扩散变换器处理姿态估计、深度预测和新视角合成,通过掩码学习策略利用不完整数据进行训练,实现了最先进的性能。
English: Matrix3D is a unified photogrammetry model that uses a multi-modal diffusion transformer to handle pose estimation, depth prediction, and novel view synthesis, achieving state-of-the-art performance through mask learning for training with incomplete data.

Authors:Anji Liu, Xuejie Liu, Dayuan Zhao, Mathias Niepert, Yitao Liang, Guy Van den Broeck
Title: Tractable Transformers for Flexible Conditional Generation
Abstract:
Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries (i.e., the set of unknown variables) unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.
中文:非自回归模型在无条件生成方面表现优异,但在条件生成中因泛化问题受限;本文提出的Tracformer通过结合局部与全局上下文信息,在条件生成任务中实现了领先性能。
English: Non-autoregressive models like diffusion language models excel in unconditional generation but struggle with conditional tasks due to generalization issues, prompting the development of Tracformer, a Transformer-based model that integrates local and global context for superior conditional generation performance.

Authors:Lin-Zhuo Chen, Kangjie Liu, Youtian Lin, Siyu Zhu, Zhihao Li, Xun Cao, Yao Yao
Title: Flow Distillation Sampling: Regularizing 3D Gaussians with Pre-trained Matching Priors
Abstract:
3D Gaussian Splatting (3DGS) has achieved excellent rendering quality with fast training and rendering speed. However, its optimization process lacks explicit geometric constraints, leading to suboptimal geometric reconstruction in regions with sparse or no observational input views. In this work, we try to mitigate the issue by incorporating a pre-trained matching prior to the 3DGS optimization process. We introduce Flow Distillation Sampling (FDS), a technique that leverages pre-trained geometric knowledge to bolster the accuracy of the Gaussian radiance field. Our method employs a strategic sampling technique to target unobserved views adjacent to the input views, utilizing the optical flow calculated from the matching model (Prior Flow) to guide the flow analytically calculated from the 3DGS geometry (Radiance Flow). Comprehensive experiments in depth rendering, mesh reconstruction, and novel view synthesis showcase the significant advantages of FDS over state-of-the-art methods. Additionally, our interpretive experiments and analysis aim to shed light on the effects of FDS on geometric accuracy and rendering quality, potentially providing readers with insights into its performance. Project page: https://nju-3dv.github.io/projects/fds
中文摘要:本文提出流蒸馏采样方法,通过引入预训练的几何先验知识,利用光流引导优化3D高斯溅射过程,有效提升了稀疏观测区域的几何重建精度。
English Summary: This paper introduces Flow Distillation Sampling (FDS), a method that enhances 3D Gaussian Splatting by integrating pre-trained geometric priors to improve reconstruction accuracy in under-observed areas through optical flow guidance.

Authors:Son Nguyen, Bo Liu, Lizhang Chen, Qiang Liu
Title: Improving Adaptive Moment Optimization via Preconditioner Diagonalization
Abstract:
Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based on estimates of gradient statistics. Compared to traditional algorithms like Stochastic Gradient Descent, these adaptive methods are typically more robust to model scale and hyperparameter tuning. However, the gradient statistics employed by these methods often do not leverage sufficient gradient covariance information, leading to suboptimal updates in certain directions of the parameter space and potentially slower convergence. In this work, we keep track of such covariance statistics in the form of a structured preconditioner matrix. Unlike other works, our approach does not apply direct approximations to estimate this matrix. We instead implement an invertible transformation that maps the preconditioner matrix into a new space where it becomes approximately diagonal. This enables a diagonal approximation of the preconditioner matrix in the transformed space, offering several computational advantages. Empirical results show that our approach can substantially enhance the convergence speed of modern adaptive optimizers. Notably, for large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam. Additionally, our method can be integrated with memory-efficient optimizers like Adafactor to manage computational overhead.
中文: 现代自适应优化器如Adam通过梯度统计动态调整更新步长,但常忽略梯度协方差导致收敛不佳;我们的方法通过结构化预条件矩阵跟踪协方差,应用可逆变换实现对角近似,显著提升收敛速度,在LLaMA等大型模型上提速高达2倍,并可结合内存高效优化器使用。
English: Modern adaptive optimizers like Adam enhance deep learning by dynamically adjusting update steps using gradient statistics, but they often neglect gradient covariance, leading to suboptimal convergence; our method addresses this by tracking covariance via a structured preconditioner matrix, applying an invertible transformation for diagonal approximation, which significantly boosts convergence speed, achieving up to 2x speedup for models like LLaMA while integrating with memory-efficient optimizers.

Authors:Hongyu An, Xinfeng Zhang, Shijie Zhao, Li Zhang, Ruiqin Xiong
Title: Spatial Degradation-Aware and Temporal Consistent Diffusion Model for Compressed Video Super-Resolution
Abstract:
Due to storage and bandwidth limitations, videos transmitted over the Internet often exhibit low quality, characterized by low-resolution and compression artifacts. Although video super-resolution (VSR) is an efficient video enhancing technique, existing VSR methods focus less on compressed videos. Consequently, directly applying general VSR approaches fails to improve practical videos with compression artifacts, especially when frames are highly compressed at a low bit rate. The inevitable quantization information loss complicates the reconstruction of texture details. Recently, diffusion models have shown superior performance in low-level visual tasks. Leveraging the high-realism generation capability of diffusion models, we propose a novel method that exploits the priors of pre-trained diffusion models for compressed VSR. To mitigate spatial distortions and refine temporal consistency, we introduce a Spatial Degradation-Aware and Temporal Consistent (SDATC) diffusion model. Specifically, we incorporate a distortion control module (DCM) to modulate diffusion model inputs, thereby minimizing the impact of noise from low-quality frames on the generation stage. Subsequently, the diffusion model performs a denoising process to generate details, guided by a fine-tuned compression-aware prompt module (CAPM) and a spatio-temporal attention module (STAM). CAPM dynamically encodes compression-related information into prompts, enabling the sampling process to adapt to different degradation levels. Meanwhile, STAM extends the spatial attention mechanism into the spatio-temporal dimension, effectively capturing temporal correlations. Additionally, we utilize optical flow-based alignment during each denoising step to enhance the smoothness of output videos. Extensive experimental results on benchmark datasets demonstrate the effectiveness of our proposed modules in restoring compressed videos.
中文摘要:本文提出一种空间退化感知与时序一致性扩散模型,利用预训练扩散先验增强压缩视频,通过失真控制和时空注意力机制有效恢复纹理细节并保持帧间连贯性。
English Summary: This paper introduces a Spatial Degradation-Aware and Temporal Consistent diffusion model that leverages pre-trained diffusion priors to enhance compressed videos by incorporating distortion control and spatio-temporal attention mechanisms for improved detail reconstruction and temporal consistency.

Authors:Subin Kim, Hoonrae Kim, Heejin Do, Gary Geunbae Lee
Title: Multimodal Cognitive Reframing Therapy via Multi-hop Psychotherapeutic Reasoning
Abstract:
Previous research has revealed the potential of large language models (LLMs) to support cognitive reframing therapy; however, their focus was primarily on text-based methods, often overlooking the importance of non-verbal evidence crucial in real-life therapy. To alleviate this gap, we extend the textual cognitive reframing to multimodality, incorporating visual clues. Specifically, we present a new dataset called Multi Modal-Cognitive Support Conversation (M2CoSC), which pairs each GPT-4-generated dialogue with an image that reflects the virtual client's facial expressions. To better mirror real psychotherapy, where facial expressions lead to interpreting implicit emotional evidence, we propose a multi-hop psychotherapeutic reasoning approach that explicitly identifies and incorporates subtle evidence. Our comprehensive experiments with both LLMs and vision-language models (VLMs) demonstrate that the VLMs' performance as psychotherapists is significantly improved with the M2CoSC dataset. Furthermore, the multi-hop psychotherapeutic reasoning method enables VLMs to provide more thoughtful and empathetic suggestions, outperforming standard prompting methods.
中文: 本研究通过引入包含视觉线索的M2CoSC数据集和多层次心理推理方法,将认知重构疗法扩展至多模态领域,显著提升了视觉语言模型的心理治疗表现和共情能力。
English: This study extends cognitive reframing therapy to multimodal contexts by introducing the M2CoSC dataset with visual cues and a multi-hop reasoning approach, significantly enhancing vision-language models' therapeutic performance and empathy.

Authors:Xiangrong Zhu, Yuexiang Xie, Yi Liu, Yaliang Li, Wei Hu
Title: Knowledge Graph-Guided Retrieval Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) has emerged as a promising technology for addressing hallucination issues in the responses generated by large language models (LLMs). Existing studies on RAG primarily focus on applying semantic-based approaches to retrieve isolated relevant chunks, which ignore their intrinsic relationships. In this paper, we propose a novel Knowledge Graph-Guided Retrieval Augmented Generation (KG$^2$RAG) framework that utilizes knowledge graphs (KGs) to provide fact-level relationships between chunks, improving the diversity and coherence of the retrieved results. Specifically, after performing a semantic-based retrieval to provide seed chunks, KG$^2$RAG employs a KG-guided chunk expansion process and a KG-based chunk organization process to deliver relevant and important knowledge in well-organized paragraphs. Extensive experiments conducted on the HotpotQA dataset and its variants demonstrate the advantages of KG$^2$RAG compared to existing RAG-based approaches, in terms of both response quality and retrieval quality.
中文: 本文提出KG$^2$RAG框架,通过知识图谱建立信息块间的事实级关联,提升检索结果的多样性和连贯性,在问答质量和检索效果上均优于现有方法。
English: This paper introduces KG$^2$RAG, a novel framework that enhances retrieval-augmented generation by using knowledge graphs to establish relationships between information chunks, improving the diversity and coherence of retrieved results and outperforming existing methods in response and retrieval quality.

Authors:Amin Adibi, Xu Cao, Zongliang Ji, Jivat Neet Kaur, Winston Chen, Elizabeth Healey, Brighton Nuwagira, Wenqian Ye, Geoffrey Woollard, Maxwell A Xu, Hejie Cui, Johnny Xi, Trenton Chang, Vasiliki Bikia, Nicole Zhang, Ayush Noori, Yuan Xia, Md. Belal Hossain, Hanna A. Frank, Alina Peluso, Yuan Pu, Shannon Zejiang Shen, John Wu, Adibvafa Fallahpour, Sazan Mahbub, Ross Duncan, Yuwei Zhang, Yurui Cao, Zuheng Xu, Michael Craig, Rahul G. Krishnan, Rahmatollah Beheshti, James M. Rehg, Mohammad Ehsanul Karim, Megan Coffee, Leo Anthony Celi, Jason Alan Fries, Mohsen Sadatsafavi, Dennis Shung, Shannon McWeeney, Jessica Dafflon, Sarah Jabbour
Title: Recent Advances, Applications and Open Challenges in Machine Learning for Health: Reflections from Research Roundtables at ML4H 2024 Symposium
Abstract:
The fourth Machine Learning for Health (ML4H) symposium was held in person on December 15th and 16th, 2024, in the traditional, ancestral, and unceded territories of the Musqueam, Squamish, and Tsleil-Waututh Nations in Vancouver, British Columbia, Canada. The symposium included research roundtable sessions to foster discussions between participants and senior researchers on timely and relevant topics for the ML4H community. The organization of the research roundtables at the conference involved 13 senior and 27 junior chairs across 13 tables. Each roundtable session included an invited senior chair (with substantial experience in the field), junior chairs (responsible for facilitating the discussion), and attendees from diverse backgrounds with an interest in the session's topic.
中文: 第四届健康机器学习研讨会于2024年12月在温哥华举行,通过由资深和初级主席主导的研究圆桌会议,促进了健康机器学习领域关键议题的深入讨论。
English: The fourth ML4H symposium was held in December 2024 in Vancouver, featuring research roundtables led by senior and junior chairs to facilitate discussions on key topics within the health machine learning community.

Authors:Zhongjie Ba, Yitao Zhang, Peng Cheng, Bin Gong, Xinyu Zhang, Qinglong Wang, Kui Ren
Title: Robust Watermarks Leak: Channel-Aware Feature Extraction Enables Adversarial Watermark Manipulation
Abstract:
Watermarking plays a key role in the provenance and detection of AI-generated content. While existing methods prioritize robustness against real-world distortions (e.g., JPEG compression and noise addition), we reveal a fundamental tradeoff: such robust watermarks inherently improve the redundancy of detectable patterns encoded into images, creating exploitable information leakage. To leverage this, we propose an attack framework that extracts leakage of watermark patterns through multi-channel feature learning using a pre-trained vision model. Unlike prior works requiring massive data or detector access, our method achieves both forgery and detection evasion with a single watermarked image. Extensive experiments demonstrate that our method achieves a 60\% success rate gain in detection evasion and 51\% improvement in forgery accuracy compared to state-of-the-art methods while maintaining visual fidelity. Our work exposes the robustness-stealthiness paradox: current "robust" watermarks sacrifice security for distortion resistance, providing insights for future watermark design.
中文摘要:本研究揭示了AI生成图像水印技术中稳健性与安全性之间的根本矛盾,提出一种仅需单张水印图像即可通过预训练视觉模型实现检测规避与内容伪造的双重攻击框架,实验证明该方法在规避检测和伪造准确率上分别提升60%和51%,突破了现有水印技术的安全局限。
English Summary: This study exposes a critical tradeoff in AI-generated image watermarking, where robust watermarks create detectable pattern redundancy that enables a novel attack method using single-image analysis to simultaneously evade detection and enable forgery with significantly improved success rates.

Authors:Shuning Zhang, Xin Yi, Shixuan Li, Chuye Hong, Gujun Chen, Jiarui Liu, Xueyang Wang, Yongquan Hu, Yuntao Wang, Hewu Li
Title: Actual Achieved Gain and Optimal Perceived Gain: Modeling Human Take-over Decisions Towards Automated Vehicles' Suggestions
Abstract:
Driver decision quality in take-overs is critical for effective human-Autonomous Driving System (ADS) collaboration. However, current research lacks detailed analysis of its variations. This paper introduces two metrics--Actual Achieved Gain (AAG) and Optimal Perceived Gain (OPG)--to assess decision quality, with OPG representing optimal decisions and AAG reflecting actual outcomes. Both are calculated as weighted averages of perceived gains and losses, influenced by ADS accuracy. Study 1 (N=315) used a 21-point Thurstone scale to measure perceived gains and losses-key components of AAG and OPG-across typical tasks: route selection, overtaking, and collision avoidance. Studies 2 (N=54) and 3 (N=54) modeled decision quality under varying ADS accuracy and decision time. Results show with sufficient time (>3.5s), AAG converges towards OPG, indicating rational decision-making, while limited time leads to intuitive and deterministic choices. Study 3 also linked AAG-OPG deviations to irrational behaviors. An intervention study (N=8) and a pilot (N=4) employing voice alarms and multi-modal alarms based on these deviations demonstrated AAG's potential to improve decision quality.
中文摘要:本文提出AAG和OPG指标评估接管过程中的驾驶员决策质量,研究表明充足决策时间可实现理性选择而时间压力导致直觉决策,基于这些指标的干预措施显示出提升决策质量的潜力。
English Summary: This paper introduces AAG and OPG metrics to evaluate driver decision quality during take-overs, revealing that sufficient decision time enables rational choices while time constraints lead to intuitive decisions, with interventions using these metrics showing potential for improvement.

Authors:Panqi Chen, Lei Cheng, Jianlong Li, Weichang Li, Weiqing Liu, Jiang Bian, Shikai Fang
Title: Functional Complexity-adaptive Temporal Tensor Decomposition
Abstract:
Tensor decomposition is a fundamental tool for analyzing multi-dimensional data by learning low-rank factors to represent high-order interactions. While recent works on temporal tensor decomposition have made significant progress by incorporating continuous timestamps in latent factors, they still struggle with general tensor data with continuous indexes not only in the temporal mode but also in other modes, such as spatial coordinates in climate data. Moreover, the challenge of self-adapting model complexity is largely unexplored in functional temporal tensor models, with existing methods being inapplicable in this setting. To address these limitations, we propose functional \underline{C}omplexity-\underline{A}daptive \underline{T}emporal \underline{T}ensor d\underline{E}composition (\textsc{Catte}). Our approach encodes continuous spatial indexes as learnable Fourier features and employs neural ODEs in latent space to learn the temporal trajectories of factors. To enable automatic adaptation of model complexity, we introduce a sparsity-inducing prior over the factor trajectories. We develop an efficient variational inference scheme with an analytical evidence lower bound, enabling sampling-free optimization. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that \textsc{Catte} not only reveals the underlying ranks of functional temporal tensors but also significantly outperforms existing methods in prediction performance and robustness against noise.
中文摘要:提出的CATTE模型通过引入可学习的傅里叶特征编码连续空间索引、在潜空间使用神经ODE学习时间轨迹,并采用稀疏诱导先验实现模型复杂度自适应,在揭示功能时序张量内在秩和预测性能方面显著优于现有方法。
English summary: The proposed CATTE model addresses limitations in temporal tensor decomposition by incorporating continuous spatial indexes and neural ODEs while introducing a sparsity-inducing prior for automatic model complexity adaptation, demonstrating superior performance in revealing underlying ranks and prediction accuracy.

Authors:Panqi Chen, Lei Cheng, Jianlong Li, Weichang Li, Weiqing Liu, Jiang Bian, Shikai Fang
Title: Functional Complexity-adaptive Temporal Tensor Decomposition
Abstract:
Tensor decomposition is a fundamental tool for analyzing multi-dimensional data by learning low-rank factors to represent high-order interactions. While recent works on temporal tensor decomposition have made significant progress by incorporating continuous timestamps in latent factors, they still struggle with general tensor data with continuous indexes not only in the temporal mode but also in other modes, such as spatial coordinates in climate data. Moreover, the challenge of self-adapting model complexity is largely unexplored in functional temporal tensor models, with existing methods being inapplicable in this setting. To address these limitations, we propose functional \underline{C}omplexity-\underline{A}daptive \underline{T}emporal \underline{T}ensor d\underline{E}composition (\textsc{Catte}). Our approach encodes continuous spatial indexes as learnable Fourier features and employs neural ODEs in latent space to learn the temporal trajectories of factors. To enable automatic adaptation of model complexity, we introduce a sparsity-inducing prior over the factor trajectories. We develop an efficient variational inference scheme with an analytical evidence lower bound, enabling sampling-free optimization. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that \textsc{Catte} not only reveals the underlying ranks of functional temporal tensors but also significantly outperforms existing methods in prediction performance and robustness against noise.
中文摘要:提出的CATTE模型通过引入可学习的傅里叶特征编码连续空间索引、在潜空间使用神经ODE学习时间轨迹,并采用稀疏诱导先验实现模型复杂度自适应,在揭示功能时序张量内在秩和预测性能方面显著优于现有方法。
English summary: The proposed CATTE model addresses limitations in temporal tensor decomposition by incorporating continuous spatial indexes and neural ODEs while introducing a sparsity-inducing prior for automatic model complexity adaptation, demonstrating superior performance in revealing underlying ranks and prediction accuracy.

Authors:A. Karthick kumar, S. Rathnamala, T. Vijayashanthi, M. Prabhananthakumar, Alavikunhu Panthakkan, Shadi Atalla, Wathiq Mansoor
Title: Enhanced Hybrid Deep Learning Approach for Botnet Attacks Detection in IoT Environment
Abstract:
Cyberattacks in an Internet of Things (IoT) environment can have significant impacts because of the interconnected nature of devices and systems. An attacker uses a network of compromised IoT devices in a botnet attack to carry out various harmful activities. Detecting botnet attacks poses several challenges because of the intricate and evolving nature of these threats. Botnet attacks erode trust in IoT devices and systems, undermining confidence in their security, reliability, and integrity. Deep learning techniques have significantly enhanced the detection of botnet attacks due to their ability to analyze and learn from complex patterns in data. This research proposed the stacking of Deep convolutional neural networks, Bi-Directional Long Short-Term Memory (Bi-LSTM), Bi-Directional Gated Recurrent Unit (Bi-GRU), and Recurrent Neural Networks (RNN) for botnet attacks detection. The UNSW-NB15 dataset is utilized for botnet attacks detection. According to experimental results, the proposed model accurately provides for the intricate patterns and features of botnet attacks, with a testing accuracy of 99.76%. The proposed model also identifies botnets with a high ROC-AUC curve value of 99.18%. A performance comparison of the proposed method with existing state-of-the-art models confirms its higher performance. The outcomes of this research could strengthen cyber security procedures and safeguard against new attacks.
中文摘要:本研究提出一种结合多种神经网络的深度学习堆叠模型,能有效检测物联网环境中的僵尸网络攻击,准确率高且性能优于现有方法。
English Summary: This research proposes a deep learning-based stacked model combining multiple neural networks to effectively detect botnet attacks in IoT environments, achieving high accuracy and outperforming existing methods.

Authors:Dai Shi, Kuan Yan, Lequan Lin, Yue Zeng, Ting Zhang, Dmytro Matsypura, Mark C. Gillies, Ling Zhu, Junbin Gao
Title: Graph Pseudotime Analysis and Neural Stochastic Differential Equations for Analyzing Retinal Degeneration Dynamics and Beyond
Abstract:
Understanding disease progression at the molecular pathway level usually requires capturing both structural dependencies between pathways and the temporal dynamics of disease evolution. In this work, we solve the former challenge by developing a biologically informed graph-forming method to efficiently construct pathway graphs for subjects from our newly curated JR5558 mouse transcriptomics dataset. We then develop Graph-level Pseudotime Analysis (GPA) to infer graph-level trajectories that reveal how disease progresses at the population level, rather than in individual subjects. Based on the trajectories estimated by GPA, we identify the most sensitive pathways that drive disease stage transitions. In addition, we measure changes in pathway features using neural stochastic differential equations (SDEs), which enables us to formally define and compute pathway stability and disease bifurcation points (points of no return), two fundamental problems in disease progression research. We further extend our theory to the case when pathways can interact with each other, enabling a more comprehensive and multi-faceted characterization of disease phenotypes. The comprehensive experimental results demonstrate the effectiveness of our framework in reconstructing the dynamics of the pathway, identifying critical transitions, and providing novel insights into the mechanistic understanding of disease evolution.
中文: 本研究通过生物信息学图构建方法和图水平伪时间分析(GPA),构建通路图并推断群体水平的疾病轨迹,利用神经随机微分方程识别关键转折点和分叉点,从而深入理解疾病演化的机制。
English: This study introduces a biologically informed graph-forming method and Graph-level Pseudotime Analysis (GPA) to construct pathway graphs and infer population-level disease trajectories, identifying critical transitions and bifurcation points through neural SDEs for a mechanistic understanding of disease evolution.

Authors:Runlong Yu, Chonghao Qiu, Robert Ladwig, Paul Hanson, Yiqun Xie, Xiaowei Jia
Title: Physics-Guided Foundation Model for Scientific Discovery: An Application to Aquatic Science
Abstract:
Physics-guided machine learning (PGML) has become a prevalent approach in studying scientific systems due to its ability to integrate scientific theories for enhancing machine learning (ML) models. However, most PGML approaches are tailored to isolated and relatively simple tasks, which limits their applicability to complex systems involving multiple interacting processes and numerous influencing features. In this paper, we propose a \textit{\textbf{P}hysics-\textbf{G}uided \textbf{F}oundation \textbf{M}odel (\textbf{PGFM})} that combines pre-trained ML models and physics-based models and leverages their complementary strengths to improve the modeling of multiple coupled processes. To effectively conduct pre-training, we construct a simulated environmental system that encompasses a wide range of influencing features and various simulated variables generated by physics-based models. The model is pre-trained in this system to adaptively select important feature interactions guided by multi-task objectives. We then fine-tune the model for each specific task using true observations, while maintaining consistency with established physical theories, such as the principles of mass and energy conservation. We demonstrate the effectiveness of this methodology in modeling water temperature and dissolved oxygen dynamics in real-world lakes. The proposed PGFM is also broadly applicable to a range of scientific fields where physics-based models are being used.
中文摘要:本文提出了一种物理引导基础模型(PGFM),通过结合预训练机器学习与物理模型来改进多耦合过程的建模,在保持物理规律一致性的同时,成功应用于真实湖泊的水温与溶解氧动态模拟。
English Summary: This paper introduces a Physics-Guided Foundation Model (PGFM) that integrates pre-trained machine learning with physics-based models to better simulate complex multi-process systems, demonstrating its effectiveness in modeling lake water quality while maintaining physical consistency.

Authors:Tenglong Liu, Jianxiong Li, Yinan Zheng, Haoyi Niu, Yixing Lan, Xin Xu, Xianyuan Zhan
Title: Skill Expansion and Composition in Parameter Space
Abstract:
Humans excel at reusing prior knowledge to address new challenges and developing skills while solving problems. This paradigm becomes increasingly popular in the development of autonomous agents, as it develops systems that can self-evolve in response to new challenges like human beings. However, previous methods suffer from limited training efficiency when expanding new skills and fail to fully leverage prior knowledge to facilitate new task learning. In this paper, we propose Parametric Skill Expansion and Composition (PSEC), a new framework designed to iteratively evolve the agents' capabilities and efficiently address new challenges by maintaining a manageable skill library. This library can progressively integrate skill primitives as plug-and-play Low-Rank Adaptation (LoRA) modules in parameter-efficient finetuning, facilitating efficient and flexible skill expansion. This structure also enables the direct skill compositions in parameter space by merging LoRA modules that encode different skills, leveraging shared information across skills to effectively program new skills. Based on this, we propose a context-aware module to dynamically activate different skills to collaboratively handle new tasks. Empowering diverse applications including multi-objective composition, dynamics shift, and continual policy shift, the results on D4RL, DSRL benchmarks, and the DeepMind Control Suite show that PSEC exhibits superior capacity to leverage prior knowledge to efficiently tackle new challenges, as well as expand its skill libraries to evolve the capabilities. Project website: https://ltlhuuu.github.io/PSEC/.
中文:PSEC框架通过可插拔的LoRA模块库使自主智能体能够高效扩展和组合技能,动态激活这些技能以应对新挑战,同时利用先验知识实现卓越性能。
English: The PSEC framework enables autonomous agents to efficiently expand and compose skills using a library of plug-and-play LoRA modules, dynamically activating them to tackle new challenges while leveraging prior knowledge for superior performance.

Authors:Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li, Xiaowen Chu, Fei Deng
Title: SphereFusion: Efficient Panorama Depth Estimation via Gated Fusion
Abstract:
Due to the rapid development of panorama cameras, the task of estimating panorama depth has attracted significant attention from the computer vision community, especially in applications such as robot sensing and autonomous driving. However, existing methods relying on different projection formats often encounter challenges, either struggling with distortion and discontinuity in the case of equirectangular, cubemap, and tangent projections, or experiencing a loss of texture details with the spherical projection. To tackle these concerns, we present SphereFusion, an end-to-end framework that combines the strengths of various projection methods. Specifically, SphereFusion initially employs 2D image convolution and mesh operations to extract two distinct types of features from the panorama image in both equirectangular and spherical projection domains. These features are then projected onto the spherical domain, where a gate fusion module selects the most reliable features for fusion. Finally, SphereFusion estimates panorama depth within the spherical domain. Meanwhile, SphereFusion employs a cache strategy to improve the efficiency of mesh operation. Extensive experiments on three public panorama datasets demonstrate that SphereFusion achieves competitive results with other state-of-the-art methods, while presenting the fastest inference speed at only 17 ms on a 512$\times$1024 panorama image.
Chinese: SphereFusion是一种端到端框架,通过门控融合模块结合等距柱面和球面投影的特征,解决了全景深度估计中的难题,在保持最快17毫秒推理速度的同时达到了最先进的性能。
English: SphereFusion is an end-to-end framework that addresses panorama depth estimation challenges by fusing features from equirectangular and spherical projections using a gate fusion module, achieving state-of-the-art results with the fastest inference speed of 17 ms.

Authors:Yijun Yang, Lichao Wang, Xiao Yang, Lanqing Hong, Jun Zhu
Title: Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails
Abstract:
Vision Large Language Models (VLLMs) integrate visual data processing, expanding their real-world applications, but also increasing the risk of generating unsafe responses. In response, leading companies have implemented Multi-Layered safety defenses, including alignment training, safety system prompts, and content moderation. However, their effectiveness against sophisticated adversarial attacks remains largely unexplored. In this paper, we propose MultiFaceted Attack, a novel attack framework designed to systematically bypass Multi-Layered Defenses in VLLMs. It comprises three complementary attack facets: Visual Attack that exploits the multimodal nature of VLLMs to inject toxic system prompts through images; Alignment Breaking Attack that manipulates the model's alignment mechanism to prioritize the generation of contrasting responses; and Adversarial Signature that deceives content moderators by strategically placing misleading information at the end of the response. Extensive evaluations on eight commercial VLLMs in a black-box setting demonstrate that MultiFaceted Attack achieves a 61.56% attack success rate, surpassing state-of-the-art methods by at least 42.18%.
Chinese: 本文提出多层面攻击框架,通过视觉攻击、对齐破坏攻击和对抗签名三种策略系统性地绕过视觉大语言模型的多层安全防御,在黑盒测试中达到61.56%的攻击成功率。
English: This paper introduces MultiFaceted Attack, a novel framework that systematically bypasses multi-layered safety defenses in Vision Large Language Models through three coordinated attack strategies, achieving a 61.56% success rate in black-box evaluations.

Authors:Zora Che, Stephen Casper, Robert Kirk, Anirudh Satheesh, Stewart Slocum, Lev E McKinney, Rohit Gandikota, Aidan Ewart, Domenic Rosati, Zichu Wu, Zikui Cai, Bilal Chughtai, Yarin Gal, Furong Huang, Dylan Hadfield-Menell
Title: Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Abstract:
Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot fully evaluate realistic risks from open-weight models. Second, the behaviors identified during any particular input-output evaluation can only lower-bound the model's worst-possible-case input-output behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together, these results highlight the difficulty of suppressing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations than input-space attacks alone.
中文: 当前基于输入输出的大语言模型风险评估存在局限,因此提出通过修改潜在参数的模型篡改攻击作为更严谨的补充方法,揭示了安全措施的脆弱性和低维鲁棒性子空间的存在。
English: Current LLM risk evaluations relying on input-output methods are limited, so model tampering attacks that modify latent parameters are proposed as a more rigorous complementary approach, revealing both the fragility of safety measures and the existence of low-dimensional robustness subspaces.

Authors:Haohao Zhu, Junyu Lu, Zeyuan Zeng, Zewen Bai, Xiaokun Zhang, Liang Yang, Hongfei Lin
Title: Commonality and Individuality! Integrating Humor Commonality with Speaker Individuality for Humor Recognition
Abstract:
Humor recognition aims to identify whether a specific speaker's text is humorous. Current methods for humor recognition mainly suffer from two limitations: (1) they solely focus on one aspect of humor commonalities, ignoring the multifaceted nature of humor; and (2) they typically overlook the critical role of speaker individuality, which is essential for a comprehensive understanding of humor expressions. To bridge these gaps, we introduce the Commonality and Individuality Incorporated Network for Humor Recognition (CIHR), a novel model designed to enhance humor recognition by integrating multifaceted humor commonalities with the distinctive individuality of speakers. The CIHR features a Humor Commonality Analysis module that explores various perspectives of multifaceted humor commonality within user texts, and a Speaker Individuality Extraction module that captures both static and dynamic aspects of a speaker's profile to accurately model their distinctive individuality. Additionally, Static and Dynamic Fusion modules are introduced to effectively incorporate the humor commonality with speaker's individuality in the humor recognition process. Extensive experiments demonstrate the effectiveness of CIHR, underscoring the importance of concurrently addressing both multifaceted humor commonality and distinctive speaker individuality in humor recognition.
中文:CIHR模型通过整合多方面的幽默共性和说话者个性来增强幽默识别,解决了现有方法忽视这些关键因素的问题。
English: The CIHR model enhances humor recognition by integrating multifaceted humor commonalities with speaker individuality, addressing limitations in current methods that overlook these aspects.

Authors:Zeren Luo, Zifan Peng, Yule Liu, Zhen Sun, Mingchen Li, Jingyi Zheng, Xinlei He
Title: Unsafe LLM-Based Search: Quantitative Analysis and Mitigation of Safety Risks in AI Web Search
Abstract:
Recent advancements in Large Language Models (LLMs) have significantly enhanced the capabilities of AI-Powered Search Engines (AIPSEs), offering precise and efficient responses by integrating external databases with pre-existing knowledge. However, we observe that these AIPSEs raise risks such as quoting malicious content or citing malicious websites, leading to harmful or unverified information dissemination. In this study, we conduct the first safety risk quantification on seven production AIPSEs by systematically defining the threat model, risk type, and evaluating responses to various query types. With data collected from PhishTank, ThreatBook, and LevelBlue, our findings reveal that AIPSEs frequently generate harmful content that contains malicious URLs even with benign queries (e.g., with benign keywords). We also observe that directly querying a URL will increase the number of main risk-inclusive responses, while querying with natural language will slightly mitigate such risk. Compared to traditional search engines, AIPSEs outperform in both utility and safety. We further perform two case studies on online document spoofing and phishing to show the ease of deceiving AIPSEs in the real-world setting. To mitigate these risks, we develop an agent-based defense with a GPT-4.1-based content refinement tool and a URL detector. Our evaluation shows that our defense can effectively reduce the risk, with only a minor cost of reducing available information by approximately 10.7%. Our research highlights the urgent need for robust safety measures in AIPSEs.
中文: 大型语言模型的最新进展提升了AI驱动搜索引擎的能力,但也带来了安全风险,如生成有害内容和引用恶意网址,为此开发了一种有效的防御机制,能在仅减少约10.7%可用信息的情况下显著降低风险。
English: Recent advancements in Large Language Models have enhanced AI-Powered Search Engines, but they pose safety risks by generating harmful content and citing malicious URLs, prompting the development of an effective defense mechanism that reduces these risks with minimal information loss.

Authors:Ziyuan Yang, Ming Yan, Yi Zhang, Joey Tianyi Zhou
Title: Dark Distillation: Backdooring Distilled Datasets without Accessing Raw Data
Abstract:
Dataset distillation (DD) enhances training efficiency and reduces bandwidth by condensing large datasets into smaller synthetic ones. It enables models to achieve performance comparable to those trained on the raw full dataset and has become a widely adopted method for data sharing. However, security concerns in DD remain underexplored. Existing studies typically assume that malicious behavior originates from dataset owners during the initial distillation process, where backdoors are injected into raw datasets. In contrast, this work is the first to address a more realistic and concerning threat: attackers may intercept the dataset distribution process, inject backdoors into the distilled datasets, and redistribute them to users. While distilled datasets were previously considered resistant to backdoor attacks, we demonstrate that they remain vulnerable to such attacks. Furthermore, we show that attackers do not even require access to any raw data to inject the backdoors successfully. Specifically, our approach reconstructs conceptual archetypes for each class from the model trained on the distilled dataset. Backdoors are then injected into these archetypes to update the distilled dataset. Moreover, we ensure the updated dataset not only retains the backdoor but also preserves the original optimization trajectory, thus maintaining the knowledge of the raw dataset. To achieve this, a hybrid loss is designed to integrate backdoor information along the benign optimization trajectory, ensuring that previously learned information is not forgotten. Extensive experiments demonstrate that distilled datasets are highly vulnerable to backdoor attacks, with risks pervasive across various raw datasets, distillation methods, and downstream training strategies. Moreover, our attack method is efficient, capable of synthesizing a malicious distilled dataset in under one minute in certain cases.
中文: 本研究揭示蒸馏数据集极易受后门攻击,攻击者无需原始数据即可注入恶意信息,并提出一种在保持原始知识的同时高效破坏数据集安全的方法。
English: This study reveals that distilled datasets are highly vulnerable to backdoor attacks, where attackers can inject malicious data without accessing raw datasets, and demonstrates an efficient method to compromise them while preserving original knowledge.

Authors:Lingshun Kong, Jiawei Zhang, Dongqing Zou, Jimmy Ren, Xiaohe Wu, Jiangxin Dong, Jinshan Pan
Title: DeblurDiff: Real-World Image Deblurring with Generative Diffusion Models
Abstract:
Diffusion models have achieved significant progress in image generation. The pre-trained Stable Diffusion (SD) models are helpful for image deblurring by providing clear image priors. However, directly using a blurry image or pre-deblurred one as a conditional control for SD will either hinder accurate structure extraction or make the results overly dependent on the deblurring network. In this work, we propose a Latent Kernel Prediction Network (LKPN) to achieve robust real-world image deblurring. Specifically, we co-train the LKPN in latent space with conditional diffusion. The LKPN learns a spatially variant kernel to guide the restoration of sharp images in the latent space. By applying element-wise adaptive convolution (EAC), the learned kernel is utilized to adaptively process the input feature, effectively preserving the structural information of the input. This process thereby more effectively guides the generative process of Stable Diffusion (SD), enhancing both the deblurring efficacy and the quality of detail reconstruction. Moreover, the results at each diffusion step are utilized to iteratively estimate the kernels in LKPN to better restore the sharp latent by EAC. This iterative refinement enhances the accuracy and robustness of the deblurring process. Extensive experimental results demonstrate that the proposed method outperforms state-of-the-art image deblurring methods on both benchmark and real-world images.
中文: 本文提出潜在核预测网络(LKPN),通过在潜在空间与条件扩散协同训练来预测空间变化核,利用逐元素自适应卷积引导稳定扩散模型实现鲁棒的图像去模糊和细节重建增强。
English: This paper introduces a Latent Kernel Prediction Network (LKPN) that co-trains with conditional diffusion in latent space to predict spatially variant kernels, which guide Stable Diffusion through element-wise adaptive convolution for robust real-world image deblurring and enhanced detail reconstruction.

Authors:Li Sun, Zhenhao Huang, Suyang Zhou, Qiqi Wan, Hao Peng, Philip Yu
Title: RiemannGFM: Learning a Graph Foundation Model from Riemannian Geometry
Abstract:
The foundation model has heralded a new era in artificial intelligence, pretraining a single model to offer cross-domain transferability on different datasets. Graph neural networks excel at learning graph data, the omnipresent non-Euclidean structure, but often lack the generalization capacity. Hence, graph foundation model is drawing increasing attention, and recent efforts have been made to leverage Large Language Models. On the one hand, existing studies primarily focus on text-attributed graphs, while a wider range of real graphs do not contain fruitful textual attributes. On the other hand, the sequential graph description tailored for the Large Language Model neglects the structural complexity, which is a predominant characteristic of the graph. Such limitations motivate an important question: Can we go beyond Large Language Models, and pretrain a universal model to learn the structural knowledge for any graph? The answer in the language or vision domain is a shared vocabulary. We observe the fact that there also exist shared substructures underlying graph domain, and thereby open a new opportunity of graph foundation model with structural vocabulary. The key innovation is the discovery of a simple yet effective structural vocabulary of trees and cycles, and we explore its inherent connection to Riemannian geometry. Herein, we present a universal pretraining model, RiemannGFM. Concretely, we first construct a novel product bundle to incorporate the diverse geometries of the vocabulary. Then, on this constructed space, we stack Riemannian layers where the structural vocabulary, regardless of specific graph, is learned in Riemannian manifold offering cross-domain transferability. Extensive experiments show the effectiveness of RiemannGFM on a diversity of real graphs.
中文: 图基础模型RiemannGFM通过引入树和循环的结构词汇表,在黎曼流形中学习通用图知识,无需依赖文本属性或大语言模型即可实现跨领域迁移能力。
English: The graph foundation model, RiemannGFM, introduces a structural vocabulary of trees and cycles to learn universal graph knowledge, enabling cross-domain transferability through Riemannian geometry without relying on textual attributes or large language models.

Authors:Lei Ding, Danfeng Hong, Maofan Zhao, Hongruixuan Chen, Chenyu Li, Jie Deng, Naoto Yokoya, Lorenzo Bruzzone, Jocelyn Chanussot
Title: A Survey of Sample-Efficient Deep Learning for Change Detection in Remote Sensing: Tasks, Strategies, and Challenges
Abstract:
In the last decade, the rapid development of deep learning (DL) has made it possible to perform automatic, accurate, and robust Change Detection (CD) on large volumes of Remote Sensing Images (RSIs). However, despite advances in CD methods, their practical application in real-world contexts remains limited due to the diverse input data and the applicational context. For example, the collected RSIs can be time-series observations, and more informative results are required to indicate the time of change or the specific change category. Moreover, training a Deep Neural Network (DNN) requires a massive amount of training samples, whereas in many cases these samples are difficult to collect. To address these challenges, various specific CD methods have been developed considering different application scenarios and training resources. Additionally, recent advancements in image generation, self-supervision, and visual foundation models (VFMs) have opened up new approaches to address the 'data-hungry' issue of DL-based CD. The development of these methods in broader application scenarios requires further investigation and discussion. Therefore, this article summarizes the literature methods for different CD tasks and the available strategies and techniques to train and deploy DL-based CD methods in sample-limited scenarios. We expect that this survey can provide new insights and inspiration for researchers in this field to develop more effective CD methods that can be applied in a wider range of contexts.
中文: 深度学习推动了遥感变化检测的进步,但实际应用因数据多样性和训练样本不足而受限,促使采用图像生成和自监督等新方法以提升效能。
English: Deep learning has advanced change detection in remote sensing, yet practical applications face challenges from diverse data needs and limited training samples, prompting new approaches like image generation and self-supervision to enhance effectiveness.

Authors:Robert J. Joyce, Derek Everett, Maya Fuchs, Edward Raff, James Holt
Title: ClarAVy: A Tool for Scalable and Accurate Malware Family Labeling
Abstract:
Determining the family to which a malicious file belongs is an essential component of cyberattack investigation, attribution, and remediation. Performing this task manually is time consuming and requires expert knowledge. Automated tools using that label malware using antivirus detections lack accuracy and/or scalability, making them insufficient for real-world applications. Three pervasive shortcomings in these tools are responsible: (1) incorrect parsing of antivirus detections, (2) errors during family alias resolution, and (3) an inappropriate antivirus aggregation strategy. To address each of these, we created our own malware family labeling tool called ClarAVy. ClarAVy utilizes a Variational Bayesian approach to aggregate detections from a collection of antivirus products into accurate family labels. Our tool scales to enormous malware datasets, and we evaluated it by labeling $\approx$40 million malicious files. ClarAVy has 8 and 12 percentage points higher accuracy than the prior leading tool in labeling the MOTIF and MalPedia datasets, respectively.
Chinese: ClarAVy 是一款自动化的恶意软件家族标记工具,采用变分贝叶斯方法精确整合防病毒检测结果,有效克服了现有工具的缺陷,并在大规模数据集上展现出更高的准确性和可扩展性。
English: ClarAVy is an automated malware family labeling tool that uses a Variational Bayesian approach to accurately aggregate antivirus detections, addressing key shortcomings in existing methods and demonstrating higher accuracy and scalability on large datasets.

Authors:Steve Azzolin, Sagar Malhotra, Andrea Passerini, Stefano Teso
Title: Beyond Topological Self-Explainable GNNs: A Formal Explainability Perspective
Abstract:
Self-Explainable Graph Neural Networks (SE-GNNs) are popular explainable-by-design GNNs, but their explanations' properties and limitations are not well understood. Our first contribution fills this gap by formalizing the explanations extracted by some popular SE-GNNs, referred to as Minimal Explanations (MEs), and comparing them to established notions of explanations, namely Prime Implicant (PI) and faithful explanations. Our analysis reveals that MEs match PI explanations for a restricted but significant family of tasks. In general, however, they can be less informative than PI explanations and are surprisingly misaligned with widely accepted notions of faithfulness. Although faithful and PI explanations are informative, they are intractable to find and we show that they can be prohibitively large. Given these observations, a natural choice is to augment SE-GNNs with alternative modalities of explanations taking care of SE-GNNs' limitations. To this end, we propose Dual-Channel GNNs that integrate a white-box rule extractor and a standard SE-GNN, adaptively combining both channels. Our experiments show that even a simple instantiation of Dual-Channel GNNs can recover succinct rules and perform on par or better than widely used SE-GNNs.
中文:自解释图神经网络(SE-GNN)的解释存在局限性,为此提出双通道图神经网络,通过结合白盒规则提取器和SE-GNN来提升性能并生成更简洁的规则。
English: Self-Explainable Graph Neural Networks (SE-GNNs) have limitations in their explanations, which are addressed by proposing Dual-Channel GNNs that combine a white-box rule extractor with SE-GNNs to enhance performance and generate more concise rules.

Authors:Valentin De Bortoli, Alexandre Galashov, J. Swaroop Guntupalli, Guangyao Zhou, Kevin Murphy, Arthur Gretton, Arnaud Doucet
Title: Distributional Diffusion Models with Scoring Rules
Abstract:
Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively "denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to accomplish sample generation by learning the posterior {\em distribution} of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.
中文: 扩散模型通过用评分规则替代回归损失,学习含噪数据对应的干净数据的完整后验分布,从而在粗时间尺度上实现快速推理,且输出质量损失极小。
English: Diffusion models can be accelerated by learning the full posterior distribution of clean data given noisy inputs, using a scoring rule instead of regression, which enables faster inference with minimal quality loss.

Authors:Shuchen Wu, Stephan Alaniz, Eric Schulz, Zeynep Akata
Title: Discovering Chunks in Neural Embeddings for Interpretability
Abstract:
Understanding neural networks is challenging due to their high-dimensional, interacting components. Inspired by human cognition, which processes complex sensory data by chunking it into recurring entities, we propose leveraging this principle to interpret artificial neural population activities. Biological and artificial intelligence share the challenge of learning from structured, naturalistic data, and we hypothesize that the cognitive mechanism of chunking can provide insights into artificial systems. We first demonstrate this concept in recurrent neural networks (RNNs) trained on artificial sequences with imposed regularities, observing that their hidden states reflect these patterns, which can be extracted as a dictionary of chunks that influence network responses. Extending this to large language models (LLMs) like LLaMA, we identify similar recurring embedding states corresponding to concepts in the input, with perturbations to these states activating or inhibiting the associated concepts. By exploring methods to extract dictionaries of identifiable chunks across neural embeddings of varying complexity, our findings introduce a new framework for interpreting neural networks, framing their population activity as structured reflections of the data they process.
中文摘要:该研究提出一种认知“组块化”框架,通过识别隐藏状态中的重复模式来解读神经网络,并在RNN和LLM中验证了其提取概念词典的有效性。
English Summary: The study proposes a cognitive "chunking" framework to interpret neural networks by identifying recurring patterns in hidden states, demonstrating its application in RNNs and LLMs to extract meaningful concept dictionaries.

Authors:Stephen Casper, Luke Bailey, Rosco Hunter, Carson Ezell, Emma Cabalé, Michael Gerovitch, Stewart Slocum, Kevin Wei, Nikola Jurkovic, Ariba Khan, Phillip J. K. Christoffersen, A. Pinar Ozisik, Rakshit Trivedi, Dylan Hadfield-Menell, Noam Kolt
Title: The AI Agent Index
Abstract:
Leading AI developers and startups are increasingly deploying agentic AI systems that can plan and execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems. To fill this gap, we introduce the AI Agent Index, the first public database to document information about currently deployed agentic AI systems. For each system that meets the criteria for inclusion in the index, we document the system's components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails), based on publicly available information and correspondence with developers. We find that while developers generally provide ample information regarding the capabilities and applications of agentic systems, they currently provide limited information regarding safety and risk management practices. The AI Agent Index is available online at https://aiagentindex.mit.edu/
中文: 领先的AI开发者和初创公司正越来越多地部署能够自主规划和执行复杂任务的智能体系统,但目前缺乏记录其技术组件、应用领域及安全措施的结构化框架,为此我们推出了首个公开数据库——AI智能体索引,以填补这一空白。
English: Leading AI developers are increasingly deploying autonomous agentic systems that handle complex tasks with minimal human oversight, yet there is no standardized framework for documenting their technical details, applications, and safety measures, prompting the creation of the AI Agent Index as the first public database to fill this gap.

Authors:Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar
Title: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Abstract:
Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.
中文摘要:AlignVLM提出了一种新颖的视觉语言对齐方法,通过将视觉特征映射到LLM文本嵌入的加权平均值,利用语言先验知识,在文档理解任务中实现了最先进的性能并增强了抗噪鲁棒性。
English Summary: AlignVLM introduces a novel vision-language alignment method that maps visual features to a weighted average of LLM text embeddings, leveraging linguistic priors to achieve state-of-the-art performance and enhanced robustness in document understanding tasks.

Authors:Marco Arazzi, Davide Ligari, Serena Nicolazzo, Antonino Nocera
Title: Augmented Knowledge Graph Querying leveraging LLMs
Abstract:
Adopting Knowledge Graphs (KGs) as a structured, semantic-oriented, data representation model has significantly improved data integration, reasoning, and querying capabilities across different domains. This is especially true in modern scenarios such as Industry 5.0, in which the integration of data produced by humans, smart devices, and production processes plays a crucial role. However, the management, retrieval, and visualization of data from a KG using formal query languages can be difficult for non-expert users due to their technical complexity, thus limiting their usage inside industrial environments. For this reason, we introduce SparqLLM, a framework that utilizes a Retrieval-Augmented Generation (RAG) solution, to enhance the querying of Knowledge Graphs (KGs). SparqLLM executes the Extract, Transform, and Load (ETL) pipeline to construct KGs from raw data. It also features a natural language interface powered by Large Language Models (LLMs) to enable automatic SPARQL query generation. By integrating template-based methods as retrieved-context for the LLM, SparqLLM enhances query reliability and reduces semantic errors, ensuring more accurate and efficient KG interactions. Moreover, to improve usability, the system incorporates a dynamic visualization dashboard that adapts to the structure of the retrieved data, presenting the query results in an intuitive format. Rigorous experimental evaluations demonstrate that SparqLLM achieves high query accuracy, improved robustness, and user-friendly interaction with KGs, establishing it as a scalable solution to access semantic data.
中文: SparqLLM框架采用检索增强生成技术和自然语言界面,简化知识图谱的查询过程,实现自动生成SPARQL查询和动态可视化,在工业环境中显著提升交互准确性与易用性。
English: SparqLLM is a framework that uses Retrieval-Augmented Generation and a natural language interface to simplify querying Knowledge Graphs, enabling automatic SPARQL generation and dynamic visualization for improved accuracy and usability in industrial settings.

Authors:Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi
Title: ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Abstract:
We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.
中文: 本研究通过ZebraLogic框架评估大语言模型在复杂逻辑谜题上的推理能力,发现即使扩大模型规模或采用增强推理策略,准确性仍随问题复杂度上升而显著下降,揭示了"复杂性诅咒"现象。
English: This study introduces ZebraLogic to evaluate large language models' logical reasoning on complex puzzles, revealing a "curse of complexity" where accuracy declines with difficulty despite model scaling and enhanced inference methods.

Authors:Kanika Goswami, Puneet Mathur, Ryan Rossi, Franck Dernoncourt
Title: ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution
Abstract:
Large Language Models (LLMs) can perform chart question-answering tasks but often generate unverified hallucinated responses. Existing answer attribution methods struggle to ground responses in source charts due to limited visual-semantic context, complex visual-text alignment requirements, and difficulties in bounding box prediction across complex layouts. We present ChartCitor, a multi-agent framework that provides fine-grained bounding box citations by identifying supporting evidence within chart images. The system orchestrates LLM agents to perform chart-to-table extraction, answer reformulation, table augmentation, evidence retrieval through pre-filtering and re-ranking, and table-to-chart mapping. ChartCitor outperforms existing baselines across different chart types. Qualitative user studies show that ChartCitor helps increase user trust in Generative AI by providing enhanced explainability for LLM-assisted chart QA and enables professionals to be more productive.
Chinese: ChartCitor是一个多智能体框架,通过提供细粒度的边界框引用,改进了图表问答任务,其性能优于现有方法,并通过增强可解释性提高了用户信任度。
English: ChartCitor is a multi-agent framework that improves chart question-answering by providing fine-grained bounding box citations, outperforming existing methods and increasing user trust through enhanced explainability.

Authors:Kanika Goswami, Puneet Mathur, Ryan Rossi, Franck Dernoncourt
Title: PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback
Abstract:
Scientific data visualization is pivotal for transforming raw data into comprehensible visual representations, enabling pattern recognition, forecasting, and the presentation of data-driven insights. However, novice users often face difficulties due to the complexity of selecting appropriate tools and mastering visualization techniques. Large Language Models (LLMs) have recently demonstrated potential in assisting code generation, though they struggle with accuracy and require iterative debugging. In this paper, we propose PlotGen, a novel multi-agent framework aimed at automating the creation of precise scientific visualizations. PlotGen orchestrates multiple LLM-based agents, including a Query Planning Agent that breaks down complex user requests into executable steps, a Code Generation Agent that converts pseudocode into executable Python code, and three retrieval feedback agents - a Numeric Feedback Agent, a Lexical Feedback Agent, and a Visual Feedback Agent - that leverage multimodal LLMs to iteratively refine the data accuracy, textual labels, and visual correctness of generated plots via self-reflection. Extensive experiments show that PlotGen outperforms strong baselines, achieving a 4-6 percent improvement on the MatPlotBench dataset, leading to enhanced user trust in LLM-generated visualizations and improved novice productivity due to a reduction in debugging time needed for plot errors.
中文摘要:PlotGen是一种多智能体框架,通过分解用户查询、生成代码并利用多模态反馈迭代优化绘图精度,实现了科学可视化的自动化,在MatPlotBench数据集上性能超越基线4-6%,同时增强了用户信任并提升了新手效率。
English Summary: PlotGen is a multi-agent framework that automates scientific visualization by decomposing user queries, generating code, and using multimodal feedback to iteratively refine plot accuracy, outperforming baselines by 4-6% while boosting user trust and novice productivity.

Authors:Shengyu Feng, Jaehyung Kim, Yiming Yang, Joseph Boudreau, Tasnuva Chowdhury, Adolfy Hoisie, Raees Khan, Ozgur O. Kilic, Scott Klasky, Tatiana Korchuganova, Paul Nilsson, Verena Ingrid Martinez Outschoorn, David K. Park, Norbert Podhorszki, Yihui Ren, Frederic Suter, Sairam Sri Vatsavai, Wei Yang, Shinjae Yoo, Tadashi Maeno, Alexei Klimentov
Title: Alternative Mixed Integer Linear Programming Optimization for Joint Job Scheduling and Data Allocation in Grid Computing
Abstract:
This paper presents a novel approach to the joint optimization of job scheduling and data allocation in grid computing environments. We formulate this joint optimization problem as a mixed integer quadratically constrained program. To tackle the nonlinearity in the constraint, we alternatively fix a subset of decision variables and optimize the remaining ones via Mixed Integer Linear Programming (MILP). We solve the MILP problem at each iteration via an off-the-shelf MILP solver. Our experimental results show that our method significantly outperforms existing heuristic methods, employing either independent optimization or joint optimization strategies. We have also verified the generalization ability of our method over grid environments with various sizes and its high robustness to the algorithm hyper-parameters.
本文提出了一种网格计算中作业调度与数据分配的联合优化方法,采用混合整数规划方法,相比现有方法展现出更优的性能和鲁棒性。
This paper introduces a joint optimization method for job scheduling and data allocation in grid computing using a mixed integer programming approach, which demonstrates superior performance and robustness compared to existing methods.

Authors:Jonathan Will, Lauritz Thamsen, Jonathan Bader, Odej Kao
Title: Flora: Efficient Cloud Resource Selection for Big Data Processing via Job Classification
Abstract:
Distributed dataflow systems like Spark and Flink enable data-parallel processing of large datasets on clusters of cloud resources. Yet, selecting appropriate computational resources for dataflow jobs is often challenging. For efficient execution, individual resource allocations, such as memory and CPU cores, must meet the specific resource demands of the job. Meanwhile, the choices of cloud configurations are often plentiful, especially in public clouds, and the current cost of the available resource options can fluctuate. Addressing this challenge, we present Flora, a low-overhead approach to cost-optimizing cloud cluster configurations for big data processing. Flora lets users categorize jobs according to their data access patterns and derives suitable cluster resource configurations from executions of test jobs of the same category, considering current resource costs. In our evaluation on a new dataset comprising 180 Spark job executions on Google Cloud, Flora's cluster resource selections exhibit an average deviation below 6% from the most cost-optimal solution, with a maximum deviation below 24%.
Chinese: Flora 是一种低开销方法,通过根据数据访问模式对作业进行分类并推导出成本优化的资源配置,在大数据处理中实现了平均偏离最优成本不到6%的效果。
English: Flora is a low-overhead method that optimizes cloud cluster configurations for big data processing by categorizing jobs based on data access patterns and deriving cost-effective resource settings, achieving within 6% of the optimal cost on average in evaluations.

Authors:Qinwei Ma, Jingzhe Shi, Can Jin, Jenq-Neng Hwang, Serge Belongie, Lei Li
Title: Gradient Imbalance in Direct Preference Optimization
Abstract:
Direct Preference Optimization (DPO) has been proposed as a promising alternative to Proximal Policy Optimization (PPO) based Reinforcement Learning with Human Feedback (RLHF). However, empirical evaluations consistently reveal suboptimal performance in DPO compared to common RLHF pipelines. In this work, we conduct a systematic analysis of DPO's training dynamics and identify gradient imbalance as a critical limitation. We demonstrate theoretically and empirically that this imbalance perturbs optimization trajectories, destabilizes learning, and induces suboptimal convergence. To address this issue, we propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance, highlighting a promising direction for future research.
中文: 本研究揭示了直接偏好优化(DPO)中梯度不平衡的核心缺陷,提出了带梯度重加权机制的Balanced-DPO改进方案,有效解决了该问题并提升了算法性能。
English: This study identifies gradient imbalance as a key limitation in Direct Preference Optimization (DPO) and proposes Balanced-DPO, a modified objective with gradient reweighting that effectively addresses this issue and improves performance.

Authors:Jiawei Wang, Kai Wang, Shaojie Lin, Runze Wu, Bihan Xu, Lingeng Jiang, Shiwei Zhao, Renyu Zhu, Haoyu Liu, Zhipeng Hu, Zhong Fan, Le Li, Tangjie Lyu, Changjie Fan
Title: Digital Player: Evaluating Large Language Models based Human-like Agent in Games
Abstract:
With the rapid advancement of Large Language Models (LLMs), LLM-based autonomous agents have shown the potential to function as digital employees, such as digital analysts, teachers, and programmers. In this paper, we develop an application-level testbed based on the open-source strategy game "Unciv", which has millions of active players, to enable researchers to build a "data flywheel" for studying human-like agents in the "digital players" task. This "Civilization"-like game features expansive decision-making spaces along with rich linguistic interactions such as diplomatic negotiations and acts of deception, posing significant challenges for LLM-based agents in terms of numerical reasoning and long-term planning. Another challenge for "digital players" is to generate human-like responses for social interaction, collaboration, and negotiation with human players. The open-source project can be found at https:/github.com/fuxiAIlab/CivAgent.
中文: 基于大语言模型的自主智能体展现出成为数字员工的潜力,通过开源游戏《Unciv》构建的应用级测试平台,为研究具备复杂决策与社交交互能力的类人智能体提供了实验环境。
English: Large Language Model-based autonomous agents demonstrate potential as digital employees, with an application-level testbed using the game "Unciv" enabling research into human-like agents through complex decision-making and social interactions.

Authors:Dawei Zhu, Xiyu Wei, Guangxiang Zhao, Wenhao Wu, Haosheng Zou, Junfeng Ran, Xun Wang, Lin Sun, Xiangzheng Zhang, Sujian Li
Title: Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision
Abstract:
Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.
中文:思维链提示在长上下文场景中效果显著,由此提出的LongRePS框架通过优化推理路径,在多项基准测试中实现了性能的大幅提升。
English: Chain-of-Thought prompting proves effective in long-context scenarios, leading to the development of LongRePS, a framework that enhances reasoning paths and achieves significant performance gains across various benchmarks.

Authors:Bo Wang, Yiqiao Li, Jianlong Zhou, Fang Chen
Title: Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?
Abstract:
EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.
中文摘要:本研究提出了一种结合基于大语言模型的评估与人工评估的工作流程,用于评估可解释机器学习方法,发现尽管大语言模型能有效运用主观指标,但目前尚无法取代人类评估者的角色。
English Summary: This study introduces a workflow combining LLM-based and human evaluation to assess explainable machine learning methods, finding that while LLMs effectively use subjective metrics, they cannot yet replace human judges.

Authors:Mohammad Abu Tami, Mohammed Elhenawy, Huthaifa I. Ashqar
Title: HazardNet: A Small-Scale Vision Language Model for Real-Time Traffic Safety Detection at Edge Devices
Abstract:
Traffic safety remains a vital concern in contemporary urban settings, intensified by the increase of vehicles and the complicated nature of road networks. Traditional safety-critical event detection systems predominantly rely on sensor-based approaches and conventional machine learning algorithms, necessitating extensive data collection and complex training processes to adhere to traffic safety regulations. This paper introduces HazardNet, a small-scale Vision Language Model designed to enhance traffic safety by leveraging the reasoning capabilities of advanced language and vision models. We built HazardNet by fine-tuning the pre-trained Qwen2-VL-2B model, chosen for its superior performance among open-source alternatives and its compact size of two billion parameters. This helps to facilitate deployment on edge devices with efficient inference throughput. In addition, we present HazardQA, a novel Vision Question Answering (VQA) dataset constructed specifically for training HazardNet on real-world scenarios involving safety-critical events. Our experimental results show that the fine-tuned HazardNet outperformed the base model up to an 89% improvement in F1-Score and has comparable results with improvement in some cases reach up to 6% when compared to larger models, such as GPT-4o. These advancements underscore the potential of HazardNet in providing real-time, reliable traffic safety event detection, thereby contributing to reduced accidents and improved traffic management in urban environments. Both HazardNet model and the HazardQA dataset are available at https://huggingface.co/Tami3/HazardNet and https://huggingface.co/datasets/Tami3/HazardQA, respectively.
中文: 本文提出的HazardNet是基于Qwen2-VL-2B微调的紧凑型视觉语言模型,通过新型HazardQA数据集实现了交通安全事件检测性能的显著提升(F1分数最高提升89%),并能高效部署于边缘设备。
English: This paper introduces HazardNet, a compact Vision Language Model fine-tuned from Qwen2-VL-2B, which significantly enhances traffic safety event detection with up to 89% F1-score improvement and enables efficient deployment on edge devices using the novel HazardQA dataset.

Authors:Wenxin Jiang, Berk Çakar, Mikola Lysenko, James C. Davis
Title: ConfuGuard: Using Metadata to Detect Active and Stealthy Package Confusion Attacks Accurately and at Scale
Abstract:
Package confusion attacks such as typosquatting threaten software supply chains. Attackers make packages with names that syntactically or semantically resemble legitimate ones, tricking engineers into installing malware. While prior work has developed defenses against package confusions in some software package registries, notably NPM, PyPI, and RubyGems, gaps remain: high false-positive rates, generalization to more software package ecosystems, and insights from real-world deployment. In this work, we introduce ConfuGuard, a state-of-art detector for package confusion threats. We begin by presenting the first empirical analysis of benign signals derived from prior package confusion data, uncovering their threat patterns, engineering practices, and measurable attributes. Advancing existing detectors, we leverage package metadata to distinguish benign packages, and extend support from three up to seven software package registries. Our approach significantly reduces false positive rates (from 80% to 28%), at the cost of an additional 14s average latency to filter out benign packages by analyzing the package metadata. ConfuGuard is used in production at our industry partner, whose analysts have already confirmed 630 real attacks detected by ConfuGuard.
Package confusion attacks trick developers into installing malicious software through similarly named packages, and ConfuGuard is introduced as an advanced detector that significantly reduces false positives and extends protection across seven software registries, with proven real-world effectiveness in identifying hundreds of attacks.
English Summary:

Authors:Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Chaoqun Liu, Lidong Bing, Deli Zhao, Anh Tuan Luu, Yu Rong
Title: FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving
Abstract:
Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the "System 1" way of quick reactions to the "System 2" style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model's intermediate reasoning steps unexamined. This fails to assess the model's ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for fine-grained evaluation of LLMs' reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing performance on general mathematical tasks. We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
Chinese Summary: FINEREASON 是一个通过逻辑谜题细粒度评估大语言模型逐步推理能力的新基准,填补了中间步骤评估的空白,并将数学推理在GSM8K上的表现提升了5.1%。
English Summary: FINEREASON is a new benchmark designed to evaluate large language models' step-by-step reasoning through logic puzzles, addressing the gap in assessing intermediate steps and improving mathematical reasoning by up to 5.1% on GSM8K.

Authors:Yuan Sui, Yufei He, Tri Cao, Simeng Han, Yulin Chen, Bryan Hooi
Title: Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
Abstract:
Large Language Models (LLMs) struggle with high computational time and error propagation during inference time, especially for complex tasks like math, puzzles, or coding requiring multi-step thinking. While existing reasoning models with chain-of-thoughts (CoT) can enable LLMs to do step-wise analysis and reflection, they often face the issue of wasting computation on less productive solutions and fail to make progress during inference time. In this paper, we propose Meta-Reasoner, a new framework to enable LLMs ``Think about how to think'', i.e., optimize the inference compute by adjusting strategies on how to reason during inference time. Inspired by dual-process theory, our method decouples the high-level strategy generation (e.g., backtracking, switching approaches, or restarting) from stepwise CoT generation via a lightweight progress report. The strategy module only consider the summarized version from the previous CoTs to propose new strategies accordingly. We employ the contextual multi-armed bandits (CMABs) for this module to iteratively evaluate the previous reasoning states and dynamically adjust the strategy to avoid reasoning get stuck in less productive paths during inference. Evaluations on math problems (e.g., Game-of-24, TheoremQA) and scientific problems (e.g., SciBench) demonstrate that our method improves performance by 9-12\% over previous SOTA methods while reducing inference time by 28-35\%. This approach also generalizes to other domains like creative writing, demonstrating its versatility for diverse reasoning-intensive problems using LLMs.
中文: 提出的元推理框架使大语言模型能够通过动态调整推理策略来优化计算效率,在数学和科学任务中性能提升9-12%,推理时间减少28-35%。
English: The proposed Meta-Reasoner framework enables Large Language Models to optimize inference computation by dynamically adjusting reasoning strategies, significantly improving performance by 9-12% and reducing inference time by 28-35% across mathematical and scientific tasks.

Authors:Weijie Yue, Zhongwei Si, Bolin Wu, Sixian Wang, Xiaoqi Qin, Kai Niu, Jincheng Dai, Ping Zhang
Title: NeRFCom: Feature Transform Coding Meets Neural Radiance Field for Free-View 3D Scene Semantic Transmission
Abstract:
We introduce NeRFCom, a novel communication system designed for end-to-end 3D scene transmission. Compared to traditional systems relying on handcrafted NeRF semantic feature decomposition for compression and well-adaptive channel coding for transmission error correction, our NeRFCom employs a nonlinear transform and learned probabilistic models, enabling flexible variable-rate joint source-channel coding and efficient bandwidth allocation aligned with the NeRF semantic feature's different contribution to the 3D scene synthesis fidelity. Experimental results demonstrate that NeRFCom achieves free-view 3D scene efficient transmission while maintaining robustness under adverse channel conditions.
Chinese: NeRFCom是一种新型通信系统,采用非线性变换和学习概率模型实现可变速率联合信源信道编码,在恶劣信道条件下仍能高效、鲁棒地传输3D场景。
English: NeRFCom is a novel communication system that uses nonlinear transforms and learned probabilistic models for efficient variable-rate joint source-channel coding, enabling robust and efficient transmission of 3D scenes under adverse channel conditions.

Authors:Yang Liu, Zinan Zheng, Jiashun Cheng, Fugee Tsung, Deli Zhao, Yu Rong, Jia Li
Title: CirT: Global Subseasonal-to-Seasonal Forecasting with Geometry-inspired Transformer
Abstract:
Accurate Subseasonal-to-Seasonal (S2S) climate forecasting is pivotal for decision-making including agriculture planning and disaster preparedness but is known to be challenging due to its chaotic nature. Although recent data-driven models have shown promising results, their performance is limited by inadequate consideration of geometric inductive biases. Usually, they treat the spherical weather data as planar images, resulting in an inaccurate representation of locations and spatial relations. In this work, we propose the geometric-inspired Circular Transformer (CirT) to model the cyclic characteristic of the graticule, consisting of two key designs: (1) Decomposing the weather data by latitude into circular patches that serve as input tokens to the Transformer; (2) Leveraging Fourier transform in self-attention to capture the global information and model the spatial periodicity. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate our model yields a significant improvement over the advanced data-driven models, including PanguWeather and GraphCast, as well as skillful ECMWF systems. Additionally, we empirically show the effectiveness of our model designs and high-quality prediction over spatial and temporal dimensions.
中文摘要:本文提出环形变换器模型,通过将球面气象数据分解为纬度环形区块并利用傅里叶变换捕捉空间周期性,有效解决了现有数据驱动模型在次季节至季节气候预测中几何归纳偏差不足的问题,在ERA5数据集上显著超越了PanguWeather等先进模型。
English Summary: This paper introduces the Circular Transformer (CirT), a novel data-driven model that addresses limitations in subseasonal-to-seasonal climate forecasting by incorporating geometric inductive biases through latitude-based circular patches and Fourier-enhanced self-attention, demonstrating superior performance over leading models like PanguWeather and ECMWF systems.

Authors:Linyang He, Ercong Nie, Sukru Samet Dindar, Arsalan Firoozi, Adrian Florea, Van Nguyen, Corentin Puffay, Riki Shimizu, Haotian Ye, Jonathan Brennan, Helmut Schmid, Hinrich Schütze, Nima Mesgarani
Title: XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs
Abstract:
We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs' multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.
中文:XCOMPS是一个多语言数据集,用于评估大语言模型在17种语言中的概念理解能力,发现模型在低资源语言上表现较弱,对语义细微差异敏感,且语言复杂度和模型结构显著影响概念推理效果。
English: XCOMPS is a multilingual dataset used to evaluate LLMs' conceptual understanding across 17 languages, revealing performance gaps in low-resource languages, sensitivity to semantic nuances, and the impact of model architecture and language complexity on reasoning capabilities.

Authors:Wenyuan Cheng, Zengyang Li, Peng Liang, Ran Mo, Hui Liu
Title: Unveiling Security Weaknesses in Autonomous Driving Systems: An In-Depth Empirical Study
Abstract:
The advent of Autonomous Driving Systems (ADS) has marked a significant shift towards intelligent transportation, with implications for public safety and traffic efficiency. While these systems integrate a variety of technologies and offer numerous benefits, their security is paramount, as vulnerabilities can have severe consequences for safety and trust. This study aims to systematically investigate potential security weaknesses in the codebases of prominent open-source ADS projects using CodeQL, a static code analysis tool. The goal is to identify common vulnerabilities, their distribution and persistence across versions to enhance the security of ADS. We selected three representative open-source ADS projects, Autoware, AirSim, and Apollo, based on their high GitHub star counts and Level 4 autonomous driving capabilities. Using CodeQL, we analyzed multiple versions of these projects to identify vulnerabilities, focusing on CWE categories such as CWE-190 (Integer Overflow or Wraparound) and CWE-20 (Improper Input Validation). We also tracked the lifecycle of these vulnerabilities across software versions. This approach allows us to systematically analyze vulnerabilities in projects, which has not been extensively explored in previous ADS research. Our analysis revealed that specific CWE categories, particularly CWE-190 (59.6%) and CWE-20 (16.1%), were prevalent across the selected ADS projects. These vulnerabilities often persisted for over six months, spanning multiple version iterations. The empirical assessment showed a direct link between the severity of these vulnerabilities and their tangible effects on ADS performance. These security issues among ADS still remain to be resolved. Our findings highlight the need for integrating static code analysis into ADS development to detect and mitigate common vulnerabilities.
中文: 本研究采用CodeQL系统分析主流开源自动驾驶系统的安全漏洞,发现整数溢出和输入验证不当等持续存在的缺陷会直接影响系统安全与性能。
English: This study uses CodeQL to systematically analyze security vulnerabilities in major open-source autonomous driving systems, revealing persistent flaws like integer overflows and improper input validation that impact system safety and performance.

Authors:Ife Adebara, Hawau Olamide Toyin, Nahom Tesfu Ghebremichael, AbdelRahim Elmadany, Muhammad Abdul-Mageed
Title: Where Are We? Evaluating LLM Performance on African Languages
Abstract:
Africa's rich linguistic heritage remains underrepresented in NLP, largely due to historical policies that favor foreign languages and create significant data inequities. In this paper, we integrate theoretical insights on Africa's language landscape with an empirical evaluation using Sahara - a comprehensive benchmark curated from large-scale, publicly accessible datasets capturing the continent's linguistic diversity. By systematically assessing the performance of leading large language models (LLMs) on Sahara, we demonstrate how policy-induced data variations directly impact model effectiveness across African languages. Our findings reveal that while a few languages perform reasonably well, many Indigenous languages remain marginalized due to sparse data. Leveraging these insights, we offer actionable recommendations for policy reforms and inclusive data practices. Overall, our work underscores the urgent need for a dual approach - combining theoretical understanding with empirical evaluation - to foster linguistic diversity in AI for African communities.
中文: 非洲语言多样性在自然语言处理中代表性不足,历史政策导致数据不平等;通过撒哈拉基准评估大语言模型发现,许多土著语言因数据稀疏被边缘化,亟需结合理论认知与实证评估推动政策改革和包容性数据实践。
English: Africa's linguistic diversity is underrepresented in NLP due to historical policies causing data inequities, as demonstrated by evaluating LLMs on the Sahara benchmark, which reveals marginalized Indigenous languages and calls for policy reforms and inclusive data practices.

Authors:Qi Yu, Zhichen Zeng, Yuchen Yan, Lei Ying, R. Srikant, Hanghang Tong
Title: Joint Optimal Transport and Embedding for Network Alignment
Abstract:
Network alignment, which aims to find node correspondence across different networks, is the cornerstone of various downstream multi-network and Web mining tasks. Most of the embedding-based methods indirectly model cross-network node relationships by contrasting positive and negative node pairs sampled from hand-crafted strategies, which are vulnerable to graph noises and lead to potential misalignment of nodes. Another line of work based on the optimal transport (OT) theory directly models cross-network node relationships and generates noise-reduced alignments. However, OT methods heavily rely on fixed, pre-defined cost functions that prohibit end-to-end training and are hard to generalize. In this paper, we aim to unify the embedding and OT-based methods in a mutually beneficial manner and propose a joint optimal transport and embedding framework for network alignment named JOENA. For one thing (OT for embedding), through a simple yet effective transformation, the noise-reduced OT mapping serves as an adaptive sampling strategy directly modeling all cross-network node pairs for robust embedding learning.For another (embedding for OT), on top of the learned embeddings, the OT cost can be gradually trained in an end-to-end fashion, which further enhances the alignment quality. With a unified objective, the mutual benefits of both methods can be achieved by an alternating optimization schema with guaranteed convergence. Extensive experiments on real-world networks validate the effectiveness and scalability of JOENA, achieving up to 16% improvement in MRR and 20x speedup compared with the state-of-the-art alignment methods.
中文: JOENA框架将嵌入方法和最优传输理论相结合,通过OT映射实现鲁棒的嵌入学习,并利用可训练的OT成本提升对齐质量,在准确性和效率上均取得显著进步。
English: JOENA unifies embedding and optimal transport methods for network alignment, using OT mapping for robust embedding learning and trainable OT costs to enhance alignment, achieving significant improvements in accuracy and efficiency.

Authors:Jiarong Wu, Songqiang Chen, Jialun Cao, Hau Ching Lo, Shing-Chi Cheung
Title: Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval
Abstract:
Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the generated code in specific programming languages. However, the evaluation scores revealed in this way provide a little hint as to the bottleneck of the code generation -- whether LLMs are struggling with their problem-solving capability or language-coding capability. To answer this question, we construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input. By doing so, the bottleneck of code generation in various programming languages could be isolated and identified. Our study yields several interesting findings. For example, we identify that the bottleneck of LLMs in Python programming is problem-solving, while Rust is struggling relatively more in language-coding. Also, our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort, especially for undertrained programming languages. Finally, we release the pipeline of constructing PseudoEval to facilitate the extension to existing benchmarks. PseudoEval is available at: https://anonymous.4open.science/r/PseudocodeACL25-7B74.
中文:PseudoEval是一个通过伪代码输入来区分大语言模型在问题解决与语言编码方面瓶颈的新基准,发现Python的瓶颈在于问题解决能力,而Rust则更多受限于语言编码能力。
English: PseudoEval is a new benchmark that uses pseudocode inputs to isolate whether LLMs struggle more with problem-solving or language-specific coding, revealing Python's bottleneck lies in problem-solving while Rust faces greater challenges in language-coding.

Authors:Quentin Mazouni, Helge Spieker, Arnaud Gotlieb, Mathieu Acher
Title: Policy Testing with MDPFuzz (Replicability Study)
Abstract:
In recent years, following tremendous achievements in Reinforcement Learning, a great deal of interest has been devoted to ML models for sequential decision-making. Together with these scientific breakthroughs/advances, research has been conducted to develop automated functional testing methods for finding faults in black-box Markov decision processes. Pang et al. (ISSTA 2022) presented a black-box fuzz testing framework called MDPFuzz. The method consists of a fuzzer whose main feature is to use Gaussian Mixture Models (GMMs) to compute coverage of the test inputs as the likelihood to have already observed their results. This guidance through coverage evaluation aims at favoring novelty during testing and fault discovery in the decision model. Pang et al. evaluated their work with four use cases, by comparing the number of failures found after twelve-hour testing campaigns with or without the guidance of the GMMs (ablation study). In this paper, we verify some of the key findings of the original paper and explore the limits of MDPFuzz through reproduction and replication. We re-implemented the proposed methodology and evaluated our replication in a large-scale study that extends the original four use cases with three new ones. Furthermore, we compare MDPFuzz and its ablated counterpart with a random testing baseline. We also assess the effectiveness of coverage guidance for different parameters, something that has not been done in the original evaluation. Despite this parameter analysis and unlike Pang et al.'s original conclusions, we find that in most cases, the aforementioned ablated Fuzzer outperforms MDPFuzz, and conclude that the coverage model proposed does not lead to finding more faults.
中文: 本文通过复现和扩展MDPFuzz框架发现,与原始研究结论相反,在大多数情况下基于覆盖率的模糊测试方法在故障检测方面表现不如其简化版本及随机测试。
English: This paper reproduces and extends the MDPFuzz framework, finding that contrary to the original study's conclusions, the coverage-guided fuzzer generally underperforms compared to its ablated version and random testing in fault detection.

Authors:Weipeng Jiang, Juan Zhai, Shiqing Ma, Ziyan Lei, Xiaofei Xie, Yige Wang, Chao Shen
Title: Holistic Audit Dataset Generation for LLM Unlearning via Knowledge Graph Traversal and Redundancy Removal
Abstract:
In recent years, Large Language Models (LLMs) have faced increasing demands to selectively remove sensitive information, protect privacy, and comply with copyright regulations through unlearning, by Machine Unlearning. While evaluating unlearning effectiveness is crucial, existing benchmarks are limited in scale and comprehensiveness, typically containing only a few hundred test cases. We identify two critical challenges in generating holistic audit datasets: ensuring audit adequacy and handling knowledge redundancy between forget and retain dataset. To address these challenges, we propose HANKER, an automated framework for holistic audit dataset generation leveraging knowledge graphs to achieve fine-grained coverage and eliminate redundant knowledge. Applying HANKER to the popular MUSE benchmark, we successfully generated over 69,000 and 111,000 audit cases for the News and Books datasets respectively, identifying thousands of knowledge memorization instances that the previous benchmark failed to detect. Our empirical analysis uncovers how knowledge redundancy significantly skews unlearning effectiveness metrics, with redundant instances artificially inflating the observed memorization measurements ROUGE from 19.7% to 26.1% and Entailment Scores from 32.4% to 35.2%, highlighting the necessity of systematic deduplication for accurate assessment.
Chinese: 本文提出HANKER自动化框架,利用知识图谱生成全面的机器遗忘审计数据集,解决审计充分性和知识冗余问题,显著提升了大语言模型遗忘效果的评估准确性。
English: This paper introduces HANKER, an automated framework that uses knowledge graphs to generate comprehensive audit datasets for machine unlearning, addressing challenges of audit adequacy and knowledge redundancy to improve evaluation accuracy.

Authors:Jianwei Wang, Chengming Shi, Junyao Yang, Haoran Li, Qianli Ma, Huiping Zhuang, Cen Chen, Ziqian Zeng
Title: RewardDS: Privacy-Preserving Fine-Tuning for Large Language Models via Reward Driven Data Synthesis
Abstract:
The success of large language models (LLMs) has attracted many individuals to fine-tune them for domain-specific tasks by uploading their data. However, in sensitive areas like healthcare and finance, privacy concerns often arise. One promising solution is to generate synthetic data with Differential Privacy (DP) guarantees to replace private data. However, these synthetic data contain significant flawed data, which are considered as noise. Existing solutions typically rely on naive filtering by comparing ROUGE-L scores or embedding similarities, which are ineffective in addressing the noise. To address this issue, we propose \textit{RewardDS}, a novel privacy-preserving framework that fine-tunes a reward proxy model and uses reward signals to guide the synthetic data generation. Our \textit{RewardDS} introduces two key modules, Reward Guided Filtering and Self-Optimizing Refinement, to both filter and refine the synthetic data, effectively mitigating the noise. Extensive experiments across medical, financial, and code generation domains demonstrate the effectiveness of our method.
中文摘要:提出的RewardDS框架通过奖励代理模型筛选和优化差分隐私合成数据,有效解决了大语言模型在医疗、金融等敏感领域微调时的隐私泄露与数据噪声问题。
English Summary: The proposed RewardDS framework addresses privacy concerns in fine-tuning large language models by using a reward proxy model to filter and refine differentially private synthetic data, effectively reducing noise across various domains.

Authors:Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, Xilun Chen
Title: DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers
Abstract:
Large language models (LLMs) have demonstrated strong effectiveness and robustness while fine-tuned as dense retrievers. However, their large parameter size brings significant inference time computational challenges, including high encoding costs for large-scale corpora and increased query latency, limiting their practical deployment. While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data. In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers. In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup. Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across multiple tasks and languages. These highlight the potential of connecting the training of smaller retrievers with the growing advancements in LLMs, bridging the gap between efficiency and generalization.
Chinese: DRAMA是一种利用大语言模型高效训练小型密集检索器的框架,在保持计算效率的同时,实现了跨多任务和多语言的强大泛化能力。
English: DRAMA is a training framework that uses large language models to efficiently train smaller dense retrievers, achieving strong generalization across multiple tasks and languages while maintaining computational efficiency.

Authors:Lei Li, Sen Jia, Jianhao Wang, Zhaochong An, Jiaang Li, Jenq-Neng Hwang, Serge Belongie
Title: ChatMotion: A Multimodal Multi-Agent for Human Motion Analysis
Abstract:
Advancements in Multimodal Large Language Models (MLLMs) have improved human motion understanding. However, these models remain constrained by their "instruct-only" nature, lacking interactivity and adaptability for diverse analytical perspectives. To address these challenges, we introduce ChatMotion, a multimodal multi-agent framework for human motion analysis. ChatMotion dynamically interprets user intent, decomposes complex tasks into meta-tasks, and activates specialized function modules for motion comprehension. It integrates multiple specialized modules, such as the MotionCore, to analyze human motion from various perspectives. Extensive experiments demonstrate ChatMotion's precision, adaptability, and user engagement for human motion understanding.
中文: ChatMotion是一种创新的多模态多智能体框架,通过动态解析用户意图、分解任务并激活专业模块,实现了精确且适应性强的人体运动分析。
English: ChatMotion is a novel multimodal multi-agent framework that enhances human motion understanding by dynamically interpreting user intent, decomposing tasks, and activating specialized modules for precise and adaptable analysis.

Authors:Jasper Roe, Mike Perkins, Klaire Somoray, Dan Miller, Leon Furze
Title: To Deepfake or Not to Deepfake: Higher Education Stakeholders' Perceptions and Intentions towards Synthetic Media
Abstract:
Advances in deepfake technologies, which use generative artificial intelligence (GenAI) to mimic a person's likeness or voice, have led to growing interest in their use in educational contexts. However, little is known about how key stakeholders perceive and intend to use these tools. This study investigated higher education stakeholder perceptions and intentions regarding deepfakes through the lens of the Unified Theory of Acceptance and Use of Technology 2 (UTAUT2). Using a mixed-methods approach combining survey data (n=174) with qualitative interviews, we found that academic stakeholders demonstrated a relatively low intention to adopt these technologies (M=41.55, SD=34.14) and held complex views about their implementation. Quantitative analysis revealed adoption intentions were primarily driven by hedonic motivation, with a gender-specific interaction in price-value evaluations. Qualitative findings highlighted potential benefits of enhanced student engagement, improved accessibility, and reduced workload in content creation, but concerns regarding the exploitation of academic labour, institutional cost-cutting leading to automation, degradation of relationships in education, and broader societal impacts. Based on these findings, we propose a framework for implementing deepfake technologies in higher education that addresses institutional policies, professional development, and equitable resource allocation to thoughtfully integrate AI while maintaining academic integrity and professional autonomy.
中文摘要:本研究揭示了高等教育利益相关者对深度伪造技术采纳意愿较低且持有复杂观点,主要受享乐动机驱动并担忧学术劳动剥削及机构自动化,据此提出了维护学术诚信与专业自主权的伦理整合框架。
English Summary: This study explores higher education stakeholders' low adoption intentions and complex views on deepfake technologies, driven by hedonic motivation and concerns over academic labor exploitation and institutional automation, proposing a framework for ethical integration in academia.

Authors:Pei Liu, Haipeng Liu, Haichao Liu, Xin Liu, Jinxin Ni, Jun Ma
Title: VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion
Abstract:
Human drivers adeptly navigate complex scenarios by utilizing rich attentional semantics, but the current autonomous systems struggle to replicate this ability, as they often lose critical semantic information when converting 2D observations into 3D space. In this sense, it hinders their effective deployment in dynamic and complex environments. Leveraging the superior scene understanding and reasoning abilities of Vision-Language Models (VLMs), we propose VLM-E2E, a novel framework that uses the VLMs to enhance training by providing attentional cues. Our method integrates textual representations into Bird's-Eye-View (BEV) features for semantic supervision, which enables the model to learn richer feature representations that explicitly capture the driver's attentional semantics. By focusing on attentional semantics, VLM-E2E better aligns with human-like driving behavior, which is critical for navigating dynamic and complex environments. Furthermore, we introduce a BEV-Text learnable weighted fusion strategy to address the issue of modality importance imbalance in fusing multimodal information. This approach dynamically balances the contributions of BEV and text features, ensuring that the complementary information from visual and textual modalities is effectively utilized. By explicitly addressing the imbalance in multimodal fusion, our method facilitates a more holistic and robust representation of driving environments. We evaluate VLM-E2E on the nuScenes dataset and achieve significant improvements in perception, prediction, and planning over the baseline end-to-end model, showcasing the effectiveness of our attention-enhanced BEV representation in enabling more accurate and reliable autonomous driving tasks.
Chinese: 提出的VLM-E2E框架通过集成视觉语言模型提供注意力线索,并采用可学习的加权融合策略平衡鸟瞰图和文本特征,在nuScenes数据集上实现了感知、预测和规划能力的显著提升。
English: The proposed VLM-E2E framework enhances autonomous driving by integrating Vision-Language Models to provide attentional cues and a weighted fusion strategy for balancing BEV and text features, achieving significant improvements in perception, prediction, and planning on the nuScenes dataset.

Authors:Tim Schreiter, Andrey Rudenko, Jens V. Rüppel, Martin Magnusson, Achim J. Lilienthal
Title: Multimodal Interaction and Intention Communication for Industrial Robots
Abstract:
Successful adoption of industrial robots will strongly depend on their ability to safely and efficiently operate in human environments, engage in natural communication, understand their users, and express intentions intuitively while avoiding unnecessary distractions. To achieve this advanced level of Human-Robot Interaction (HRI), robots need to acquire and incorporate knowledge of their users' tasks and environment and adopt multimodal communication approaches with expressive cues that combine speech, movement, gazes, and other modalities. This paper presents several methods to design, enhance, and evaluate expressive HRI systems for non-humanoid industrial robots. We present the concept of a small anthropomorphic robot communicating as a proxy for its non-humanoid host, such as a forklift. We developed a multimodal and LLM-enhanced communication framework for this robot and evaluated it in several lab experiments, using gaze tracking and motion capture to quantify how users perceive the robot and measure the task progress.
中文摘要:工业机器人需通过多模态交互和用户认知实现安全直观的人机协作,本文通过开发具备大语言模型增强功能的拟人代理机器人,并利用视线追踪与动作捕捉进行评估验证。
English Summary: Industrial robots must achieve safe, intuitive human-robot interaction through multimodal communication and user understanding, as demonstrated by this paper's development of an anthropomorphic proxy robot with LLM-enhanced framework evaluated through gaze tracking and motion capture.

Authors:Hongqiu Wu, Weiqi Wu, Tianyang Xu, Jiameng Zhang, Hai Zhao
Title: Towards Enhanced Immersion and Agency for LLM-based Interactive Drama
Abstract:
LLM-based Interactive Drama is a novel AI-based dialogue scenario, where the user (i.e. the player) plays the role of a character in the story, has conversations with characters played by LLM agents, and experiences an unfolding story. This paper begins with understanding interactive drama from two aspects: Immersion, the player's feeling of being present in the story, and Agency, the player's ability to influence the story world. Both are crucial to creating an enjoyable interactive experience, while they have been underexplored in previous work. To enhance these two aspects, we first propose Playwriting-guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality. Additionally, we introduce Plot-based Reflection for LLM agents to refine their reactions to align with the player's intentions. Our evaluation relies on human judgment to assess the gains of our methods in terms of immersion and agency.
中文: 本文提出基于大语言模型的互动戏剧,通过剧本引导生成和情节反思方法提升玩家的沉浸感和能动性,并借助人工评估验证了这些方法的成效。
English: This paper introduces LLM-based Interactive Drama, focusing on enhancing player immersion and agency through Playwriting-guided Generation and Plot-based Reflection, with human evaluation confirming their effectiveness.

Authors:Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger
Title: The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Abstract:
The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.
中文摘要:本研究通过新型梯度方法揭示,大型语言模型采用多个独立的拒绝机制,挑战了先前关于单一拒绝方向的假设,并发现了控制安全行为的复杂空间结构。
English Summary: This study reveals that large language models employ multiple independent refusal mechanisms, identified through a novel gradient-based approach, challenging previous assumptions of a single refusal direction and uncovering complex spatial structures governing safety behaviors.

Authors:Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann
Title: REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective
Abstract:
To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
中文: 当前针对大语言模型的对齐规避攻击虽能提高肯定响应概率却常无法生成有害内容,导致鲁棒性被高估;提出的基于响应群体的自适应语义优化方法显著提升了攻击成功率,在Llama3上实现翻倍增长,并在电路 breaker 防御下从2%提升至50%。
English: Current adversarial attacks on large language models often fail to produce harmful responses despite maximizing affirmative response likelihood, leading to overestimated robustness; the proposed adaptive semantic optimization over response populations significantly improves attack success rates, doubling it on Llama3 and increasing from 2% to 50% with circuit breaker defense.

Authors:Hamidreza Mazandarani, Masoud Shokrnezhad, Tarik Taleb
Title: A Novel Multiple Access Scheme for Heterogeneous Wireless Communications using Symmetry-aware Continual Deep Reinforcement Learning
Abstract:
The Metaverse holds the potential to revolutionize digital interactions through the establishment of a highly dynamic and immersive virtual realm over wireless communications systems, offering services such as massive twinning and telepresence. This landscape presents novel challenges, particularly efficient management of multiple access to the frequency spectrum, for which numerous adaptive Deep Reinforcement Learning (DRL) approaches have been explored. However, challenges persist in adapting agents to heterogeneous and non-stationary wireless environments. In this paper, we present a novel approach that leverages Continual Learning (CL) to enhance intelligent Medium Access Control (MAC) protocols, featuring an intelligent agent coexisting with legacy User Equipments (UEs) with varying numbers, protocols, and transmission profiles unknown to the agent for the sake of backward compatibility and privacy. We introduce an adaptive Double and Dueling Deep Q-Learning (D3QL)-based MAC protocol, enriched by a symmetry-aware CL mechanism, which maximizes intelligent agent throughput while ensuring fairness. Mathematical analysis validates the efficiency of our proposed scheme, showcasing superiority over conventional DRL-based techniques in terms of throughput, collision rate, and fairness, coupled with real-time responsiveness in highly dynamic scenarios.
中文摘要:本文提出了一种融合持续学习的创新MAC协议,采用自适应双竞争深度Q学习来优化动态无线元宇宙环境中的频谱接入,相比传统方法在吞吐量和公平性方面展现出更优性能。
English Summary: This paper introduces a novel Continual Learning-enhanced MAC protocol using adaptive Double and Dueling Deep Q-Learning to optimize spectrum access in dynamic wireless Metaverse environments, demonstrating superior throughput and fairness over traditional methods.

Authors:Hamidreza Mazandarani, Masoud Shokrnezhad, Tarik Taleb
Title: Semantic-Aware Dynamic and Distributed Power Allocation: a Multi-UAV Area Coverage Use Case
Abstract:
The advancement towards 6G technology leverages improvements in aerial-terrestrial networking, where one of the critical challenges is the efficient allocation of transmit power. Although existing studies have shown commendable performance in addressing this challenge, a revolutionary breakthrough is anticipated to meet the demands and dynamism of 6G. Potential solutions include: 1) semantic communication and orchestration, which transitions the focus from mere transmission of bits to the communication of intended meanings of data and their integration into the network orchestration process; and 2) distributed machine learning techniques to develop adaptable and scalable solutions. In this context, this paper introduces a power allocation framework specifically designed for semantic-aware networks. The framework addresses a scenario involving multiple Unmanned Aerial Vehicles (UAVs) that collaboratively transmit observations over a multi-channel uplink medium to a central server, aiming to maximise observation quality. To tackle this problem, we present the Semantic-Aware Multi-Agent Double and Dueling Deep Q-Learning (SAMA-D3QL) algorithm, which utilizes the data quality of observing areas as reward feedback during the training phase, thereby constituting a semantic-aware learning mechanism. Simulation results substantiate the efficacy and scalability of our approach, demonstrating its superior performance compared to traditional bit-oriented learning and heuristic algorithms.
中文: 本文针对6G空天地网络提出了一种语义感知的功率分配框架,通过引入SAMA-D3QL算法将数据质量作为奖励反馈,以优化多无人机协同传输中的观测质量,仿真结果验证了该方法相较于传统方案的优越性能。
English: This paper proposes a semantic-aware power allocation framework for 6G aerial-terrestrial networks, introducing the SAMA-D3QL algorithm that leverages data quality as reward feedback to maximize observation quality in multi-UAV collaborative transmissions, with simulations confirming its superiority over conventional methods.

Authors:Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, Zhecheng Yuan, Huazhe Xu
Title: DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning
Abstract:
Visuomotor policies have shown great promise in robotic manipulation but often require substantial amounts of human-collected data for effective performance. A key reason underlying the data demands is their limited spatial generalization capability, which necessitates extensive data collection across different object configurations. In this work, we present DemoGen, a low-cost, fully synthetic approach for automatic demonstration generation. Using only one human-collected demonstration per task, DemoGen generates spatially augmented demonstrations by adapting the demonstrated action trajectory to novel object configurations. Visual observations are synthesized by leveraging 3D point clouds as the modality and rearranging the subjects in the scene via 3D editing. Empirically, DemoGen significantly enhances policy performance across a diverse range of real-world manipulation tasks, showing its applicability even in challenging scenarios involving deformable objects, dexterous hand end-effectors, and bimanual platforms. Furthermore, DemoGen can be extended to enable additional out-of-distribution capabilities, including disturbance resistance and obstacle avoidance.
中文:DemoGen是一种低成本的合成方法,通过单次人工演示生成增强示范,利用3D编辑调整动作轨迹和合成视觉,显著提升了机器人策略在多样化现实任务中的表现。
English: DemoGen is a cost-effective synthetic method that generates augmented demonstrations from a single human example, enhancing robotic policy performance across diverse real-world tasks by adapting actions and synthesizing visuals through 3D editing.

Authors:Zekai Shao, Siyu Yuan, Lin Gao, Yixuan He, Deqing Yang, Siming Chen
Title: Unlocking Scientific Concepts: How Effective Are LLM-Generated Analogies for Student Understanding and Classroom Practice?
Abstract:
Teaching scientific concepts is essential but challenging, and analogies help students connect new concepts to familiar ideas. Advancements in large language models (LLMs) enable generating analogies, yet their effectiveness in education remains underexplored. In this paper, we first conducted a two-stage study involving high school students and teachers to assess the effectiveness of LLM-generated analogies in biology and physics through a controlled in-class test and a classroom field study. Test results suggested that LLM-generated analogies could enhance student understanding particularly in biology, but require teachers' guidance to prevent over-reliance and overconfidence. Classroom experiments suggested that teachers could refine LLM-generated analogies to their satisfaction and inspire new analogies from generated ones, encouraged by positive classroom feedback and homework performance boosts. Based on findings, we developed and evaluated a practical system to help teachers generate and refine teaching analogies. We discussed future directions for developing and evaluating LLM-supported teaching and learning by analogy.
中文: 大语言模型生成的类比能有效提升学生对生物学等学科的理解,但需教师指导以避免过度依赖,且教师可基于生成内容优化出满意的教学类比。
English: Large language models can generate educational analogies that enhance student understanding, particularly in biology, but require teacher guidance to prevent over-reliance and can be refined by educators for classroom use.

Authors:Ayush Kumar Shah, Abhisek Dey, Leo Luo, Bryan Amador, Patrick Philippy, Ming Zhong, Siru Ouyang, David Mark Friday, David Bianchi, Nick Jackson, Richard Zanibbi, Jiawei Han
Title: Multimodal Search in Chemical Documents and Reactions
Abstract:
We present a multimodal search tool that facilitates retrieval of chemical reactions, molecular structures, and associated text from scientific literature. Queries may combine molecular diagrams, textual descriptions, and reaction data, allowing users to connect different representations of chemical information. To support this, the indexing process includes chemical diagram extraction and parsing, extraction of reaction data from text in tabular form, and cross-modal linking of diagrams and their mentions in text. We describe the system's architecture, key functionalities, and retrieval process, along with expert assessments of the system. This demo highlights the workflow and technical components of the search system.
中文: 该多模态搜索工具通过整合图表、文本和反应数据的查询,支持从科学文献中检索化学反应、分子结构及相关文本,并采用化学图表解析与跨模态关联的索引技术实现高效检索。
English: This multimodal search tool enables retrieval of chemical reactions, structures, and text from scientific literature through integrated queries combining diagrams, text, and reaction data, supported by comprehensive indexing and cross-modal linking processes.

Authors:Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, Sujian Li
Title: LongAttn: Selecting Long-context Training Data via Token-level Attention
Abstract:
With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.
Chinese: 本文提出LongAttn框架,通过利用大语言模型的自注意力机制在词元层面量化长程依赖关系,实现了高效的高质量长上下文数据筛选,并展现出卓越的有效性、可扩展性和效率。
English: The paper introduces LongAttn, a token-level framework that uses LLMs' self-attention to measure long-range dependencies, enabling efficient selection of high-quality long-context training data and demonstrating superior effectiveness, scalability, and efficiency.

Authors:Maram Hasanain, Md Arid Hasan, Mohamed Bayan Kmainasi, Elisa Sartori, Ali Ezzat Shahroor, Giovanni Da San Martino, Firoj Alam
Title: Reasoning About Persuasion: Can LLMs Enable Explainable Propaganda Detection?
Abstract:
There has been significant research on propagandistic content detection across different modalities and languages. However, most studies have primarily focused on detection, with little attention given to explanations justifying the predicted label. This is largely due to the lack of resources that provide explanations alongside annotated labels. To address this issue, we propose a multilingual (i.e., Arabic and English) explanation-enhanced dataset, the first of its kind. Additionally, we introduce an explanation-enhanced LLM for both label detection and rationale-based explanation generation. Our findings indicate that the model performs comparably while also generating explanations. We will make the dataset and experimental resources publicly available for the research community.
Chinese: 本研究提出了一个多语言的解释增强数据集及相应的大语言模型,该模型在检测宣传内容的同时生成解释,性能相当,解决了可解释检测资源匮乏的问题。
English: This study introduces a multilingual explanation-enhanced dataset and a corresponding LLM model that performs comparably in detecting propagandistic content while generating explanations, addressing the lack of resources for explainable detection.

Authors:Chaohao Yuan, Kangfei Zhao, Ercan Engin Kuruoglu, Liang Wang, Tingyang Xu, Wenbing Huang, Deli Zhao, Hong Cheng, Yu Rong
Title: A Survey of Graph Transformers: Architectures, Theories and Applications
Abstract:
Graph Transformers (GTs) have demonstrated a strong capability in modeling graph structures by addressing the intrinsic limitations of graph neural networks (GNNs), such as over-smoothing and over-squashing. Recent studies have proposed diverse architectures, enhanced explainability, and practical applications for Graph Transformers. In light of these rapid developments, we conduct a comprehensive review of Graph Transformers, covering aspects such as their architectures, theoretical foundations, and applications within this survey. We categorize the architecture of Graph Transformers according to their strategies for processing structural information, including graph tokenization, positional encoding, structure-aware attention and model ensemble. Furthermore, from the theoretical perspective, we examine the expressivity of Graph Transformers in various discussed architectures and contrast them with other advanced graph learning algorithms to discover the connections. Furthermore, we provide a summary of the practical applications where Graph Transformers have been utilized, such as molecule, protein, language, vision, traffic, brain and material data. At the end of this survey, we will discuss the current challenges and prospective directions in Graph Transformers for potential future research.
中文: 图变换器通过创新架构克服了图神经网络的局限,本综述全面探讨了其设计、理论基础及在多个领域的实际应用。
English: Graph Transformers overcome limitations of graph neural networks through innovative architectures and are comprehensively reviewed in this survey, covering their design, theory, and diverse applications.

Authors:João Henrique Inacio de Souza, Fabio Saggese, Beatriz Soret, Petar Popovski
Title: Preserving Simultaneity and Chronology for Sensing in Perceptive Wireless Networks
Abstract:
We address the challenge of preserving the simultaneity and chronology of sensing events in multisensor systems with wireless links. The network uses temporal windows of integration (TWIs), borrowed from human multisensory perception, to preserve the temporal structure of the sensing data at the application side. We introduce a composite latency model for propagation, sensing, and communication that leads to the derivation of the probability of simultaneity violation. This is used to select the TWI duration aiming to achieve the desired degrees of chronological preservation, while maintaining the throughput of events. The letter provides important insights and analytical tools about the TWI impact on the event registration.
Chinese: 本研究通过采用时间整合窗口来保持多传感器无线系统中感知数据的时序结构,并利用复合延迟模型来减少同步性违规,同时确保事件吞吐量,有效解决了维持事件同时性和时序性的难题。
English: This study tackles the challenge of maintaining simultaneity and chronology in multisensor wireless systems by using temporal windows of integration (TWIs) to preserve temporal data structure and a composite latency model to minimize simultaneity violations while ensuring event throughput.

Authors:João Henrique Inacio de Souza, Fabio Saggese, Beatriz Soret, Petar Popovski
Title: Preserving Simultaneity and Chronology for Sensing in Perceptive Wireless Networks
Abstract:
We address the challenge of preserving the simultaneity and chronology of sensing events in multisensor systems with wireless links. The network uses temporal windows of integration (TWIs), borrowed from human multisensory perception, to preserve the temporal structure of the sensing data at the application side. We introduce a composite latency model for propagation, sensing, and communication that leads to the derivation of the probability of simultaneity violation. This is used to select the TWI duration aiming to achieve the desired degrees of chronological preservation, while maintaining the throughput of events. The letter provides important insights and analytical tools about the TWI impact on the event registration.
Chinese: 本研究通过采用时间整合窗口来保持多传感器无线系统中感知数据的时序结构,并利用复合延迟模型来减少同步性违规,同时确保事件吞吐量,有效解决了维持事件同时性和时序性的难题。
English: This study tackles the challenge of maintaining simultaneity and chronology in multisensor wireless systems by using temporal windows of integration (TWIs) to preserve temporal data structure and a composite latency model to minimize simultaneity violations while ensuring event throughput.

Authors:Masoud Shokrnezhad, Tarik Taleb
Title: An Autonomous Network Orchestration Framework Integrating Large Language Models with Continual Reinforcement Learning
Abstract:
6G networks aim to achieve global coverage, massive connectivity, and ultra-stringent requirements. Space-Air-Ground Integrated Networks (SAGINs) and Semantic Communication (SemCom) are essential for realizing these goals, yet they introduce considerable complexity in resource orchestration. Drawing inspiration from research in robotics, a viable solution to manage this complexity is the application of Large Language Models (LLMs). Although the use of LLMs in network orchestration has recently gained attention, existing solutions have not sufficiently addressed LLM hallucinations or their adaptation to network dynamics. To address this gap, this paper proposes a framework called Autonomous Reinforcement Coordination (ARC) for a SemCom-enabled SAGIN. This framework employs an LLM-based Retrieval-Augmented Generator (RAG) monitors services, users, and resources and processes the collected data, while a Hierarchical Action Planner (HAP) orchestrates resources. ARC decomposes orchestration into two tiers, utilizing LLMs for high-level planning and Reinforcement Learning (RL) agents for low-level decision-making, in alignment with the Mixture of Experts (MoE) concept. The LLMs utilize Chain-of-Thought (CoT) reasoning for few-shot learning, empowered by contrastive learning, while the RL agents employ replay buffer management for continual learning, thereby achieving efficiency, accuracy, and adaptability. Simulations are provided to demonstrate the effectiveness of ARC, along with a comprehensive discussion on potential future research directions to enhance and upgrade ARC.
中文摘要:本文提出自主强化协调(ARC)框架,通过结合大语言模型与强化学习的双层架构,在语义通信驱动的空天地一体化网络中实现高效资源协同,利用思维链推理和回放缓冲机制分别解决模型幻觉与网络动态适应问题。
English Summary: This paper introduces the Autonomous Reinforcement Coordination (ARC) framework, which integrates Large Language Models with Reinforcement Learning to efficiently manage resource orchestration in Semantic Communication-enabled Space-Air-Ground Integrated Networks, addressing challenges like LLM hallucinations and network dynamics through a two-tiered approach combining high-level planning and low-level decision-making.

Authors:Zhipeng Cheng, Xiaoyu Xia, Hong Wang, Minghui Liwang, Ning Chen, Xuwei Fan, Xianbin Wang
Title: Privacy-Aware Joint DNN Model Deployment and Partitioning Optimization for Collaborative Edge Inference Services
Abstract:
Edge inference (EI) has emerged as a promising paradigm to address the growing limitations of cloud-based Deep Neural Network (DNN) inference services, such as high response latency, limited scalability, and severe data privacy exposure. However, deploying DNN models on resource-constrained edge devices introduces additional challenges, including limited computation/storage resources, dynamic service demands, and heightened privacy risks. To tackle these issues, this paper presents a novel privacy-aware optimization framework that jointly addresses DNN model deployment, user-server association, and model partitioning, with the goal of minimizing long-term average inference delay under resource and privacy constraints. The problem is formulated as a complex, NP-hard stochastic optimization. To efficiently handle system dynamics and computational complexity, we employ a Lyapunov-based approach to transform the long-term objective into tractable per-slot decisions. Furthermore, we introduce a coalition formation game to enable adaptive user-server association and design a greedy algorithm for model deployment within each coalition. Extensive simulations demonstrate that the proposed algorithm significantly reduces inference delay and consistently satisfies privacy constraints, outperforming state-of-the-art baselines across diverse scenarios.
中文摘要:本文提出了一种隐私感知的边缘DNN推理优化框架,通过李雅普诺夫优化和联盟博弈理论联合优化模型部署、用户-服务器关联和模型划分,在满足资源与隐私约束的同时显著降低了推理延迟。
English Summary: This paper introduces a privacy-aware optimization framework for edge DNN inference that jointly optimizes model deployment, user-server association, and model partitioning using Lyapunov optimization and coalition game theory to minimize latency while meeting resource and privacy constraints.

Authors:Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, Jun Zhu
Title: RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers
Abstract:
Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality 2x extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables 3x extrapolation by minimal fine-tuning without long videos. Project page and codes: https://riflex-video.github.io/.
中文: 本文提出RIFLEx方法,通过调整位置编码的固有频率,无需训练即可实现高质量2倍视频时长外推并保持时序连贯性,经少量微调更可扩展至3倍外推能力。
English: This paper introduces RIFLEx, a training-free method that adjusts the intrinsic frequency of positional embeddings to enable high-quality 2x video length extrapolation while maintaining temporal coherence, with extended 3x capability through minimal fine-tuning.

Authors:Peixi Wu, Bosong Chai, Hebei Li, Menghua Zheng, Yansong Peng, Zeyu Wang, Xuan Nie, Yueyi Zhang, Xiaoyan Sun
Title: Spiking Point Transformer for Point Cloud Classification
Abstract:
Spiking Neural Networks (SNNs) offer an attractive and energy-efficient alternative to conventional Artificial Neural Networks (ANNs) due to their sparse binary activation. When SNN meets Transformer, it shows great potential in 2D image processing. However, their application for 3D point cloud remains underexplored. To this end, we present Spiking Point Transformer (SPT), the first transformer-based SNN framework for point cloud classification. Specifically, we first design Queue-Driven Sampling Direct Encoding for point cloud to reduce computational costs while retaining the most effective support points at each time step. We introduce the Hybrid Dynamics Integrate-and-Fire Neuron (HD-IF), designed to simulate selective neuron activation and reduce over-reliance on specific artificial neurons. SPT attains state-of-the-art results on three benchmark datasets that span both real-world and synthetic datasets in the SNN domain. Meanwhile, the theoretical energy consumption of SPT is at least 6.4$\times$ less than its ANN counterpart.
中文:脉冲点变换器(SPT)是首个基于变换器的脉冲神经网络,用于三维点云分类,通过创新的编码和神经元激活方法,在显著降低能耗的同时实现了最优性能。
English: The Spiking Point Transformer (SPT) is the first transformer-based spiking neural network for 3D point cloud classification, achieving state-of-the-art results with significantly reduced energy consumption through innovative encoding and neuron activation methods.

Authors:A. Quadir, M. Tanveer
Title: TRKM: Twin Restricted Kernel Machines for Classification and Regression
Abstract:
Restricted kernel machines (RKMs) have considerably improved generalization in machine learning. Recent advancements explored various techniques within the RKM framework, integrating kernel functions with least squares support vector machines (LSSVM) to mirror the energy function of restricted Boltzmann machines (RBM), leading to enhanced performance. However, RKMs may face challenges in generalization when dealing with unevenly distributed or complexly clustered data. Additionally, as the dataset size increases, the computational burden of managing high-dimensional feature spaces can become substantial, potentially hindering performance in large-scale datasets. To address these challenges, we propose twin restricted kernel machine (TRKM). TRKM combines the benefits of twin models with the robustness of the RKM framework to enhance classification and regression tasks. By leveraging the Fenchel-Young inequality, we introduce a novel conjugate feature duality, allowing the formulation of classification and regression problems in terms of dual variables. This duality provides an upper bound to the objective function of the TRKM problem, resulting in a new methodology under the RKM framework. The model uses an energy function similar to that of RBM, incorporating both visible and hidden variables corresponding to both classes. Additionally, the kernel trick is employed to map data into a high-dimensional feature space, where the model identifies an optimal separating hyperplane using a regularized least squares approach. Experiments on UCI and KEEL datasets confirm TRKM's superiority over baselines, showcasing its robustness and efficiency in handling complex data. Furthermore, We implemented the TRKM model on the brain age dataset, demonstrating its efficacy in predicting brain age.
Chinese: 提出的孪生限制核机(TRKM)通过结合孪生模型和共轭特征对偶性改进了RKM框架,在保持计算效率的同时提升了分类和回归性能,并在包括脑年龄预测在内的多个数据集上得到验证。
English: The proposed twin restricted kernel machine (TRKM) enhances the RKM framework by integrating twin models and conjugate feature duality, improving classification and regression performance while maintaining computational efficiency, as validated on various datasets including brain age prediction.

Authors:Guanqi Zhan, Yuanpei Liu, Kai Han, Weidi Xie, Andrew Zisserman
Title: ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
Abstract:
The objective in this paper is to improve the performance of text-to-image retrieval. To this end, we introduce a new framework that can boost the performance of large-scale pre-trained vision-language models, so that they can be used for text-to-image re-ranking. The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple MLP mapping network, to predict a set of visual prompts to condition the ViT image encoding. ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks. To train the architecture with limited computing resources, we develop a 'student friendly' best practice, involving global hard sample mining, and curation of a large-scale dataset. On the evaluation side, we set up two new out-of-distribution (OOD) benchmarks, Occluded COCO and ImageNet-R, to assess the zero-shot generalisation of the models to different domains. The results demonstrate that ELIP significantly boosts CLIP/SigLIP/SigLIP-2 text-to-image retrieval performance and outperforms BLIP-2 on several benchmarks, as well as providing an easy means to adapt to OOD datasets.
中文: 本文提出ELIP框架,通过MLP网络预测视觉提示来增强CLIP等大规模视觉语言模型,用于文本到图像的重新排序,显著提升了检索性能和对分布外数据集的适应性,并采用高效训练方法。
English: This paper introduces ELIP, a framework that enhances large-scale vision-language models like CLIP for text-to-image re-ranking by predicting visual prompts via an MLP network, significantly improving retrieval performance and adaptability to out-of-distribution datasets with efficient training practices.

Authors:Qingyuan Liu, Yun-Yun Tsai, Ruijian Zha, Victoria Li, Pengyuan Shi, Chengzhi Mao, Junfeng Yang
Title: LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection
Abstract:
The impressive achievements of generative models in creating high-quality videos have raised concerns about digital integrity and privacy vulnerabilities. Recent works of AI-generated content detection have been widely studied in the image field (e.g., deepfake), yet the video field has been unexplored. Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection for its strong reasoning and multimodal capabilities. It breaks the limitations of traditional deep learning based methods faced with like lack of transparency and inability to recognize new artifacts. Motivated by this, we propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement. Our insight list as follows: (1) The leading LVLMs can call external tools to extract useful information to facilitate its own video detection task; (2) Structuring the prompt can affect LVLM's reasoning ability to interpret information in video content. Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting. Different from prior SOTA that trains additional detectors, our method is fully training-free and only requires inference of the LVLM for detection. To facilitate our research, we also create a new benchmark \vidfor with high-quality videos generated from multiple sources of video generation tools. Evaluation results show that LAVID improves F1 scores by 6.2 to 30.2% over the top baselines on our datasets across four SOTA LVLMs.
Chinese: 本研究提出LAVID方法,利用大型视觉语言模型结合显式知识增强和自适应提示,无需训练即可显著提升AI生成视频的检测性能,在新基准数据集上的F1分数比现有最佳方法最高提升30.2%。
English: The study introduces LAVID, a training-free method using Large Vision Language Models with explicit knowledge enhancement and adaptive prompts to significantly improve AI-generated video detection, outperforming existing baselines by up to 30.2% in F1 scores on a new benchmark dataset.

Authors:Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Shazan Ahmad, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan
Title: KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Abstract:
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 sub-domains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision-language models (such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches (like EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
中文: KITAB-Bench基准通过证明现代视觉语言模型比传统OCR系统准确率高出60%,填补了阿拉伯语OCR评估的关键空白,同时揭示了复杂文本识别和文档转换中持续存在的挑战。
English: The KITAB-Bench benchmark addresses critical gaps in Arabic OCR evaluation by demonstrating that modern vision-language models outperform traditional OCR systems by 60% in accuracy, while revealing persistent challenges in complex text recognition and document conversion.

Authors:Rui Li, Heming Xia, Xinfeng Yuan, Qingxiu Dong, Lei Sha, Wenjie Li, Zhifang Sui
Title: How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation
Abstract:
Recently, LLMs have garnered increasing attention across academic disciplines for their potential as human digital twins, virtual proxies designed to replicate individuals and autonomously perform tasks such as decision-making, problem-solving, and reasoning on their behalf. However, current evaluations of LLMs primarily emphasize dialogue simulation while overlooking human behavior simulation, which is crucial for digital twins. To address this gap, we introduce BehaviorChain, the first benchmark for evaluating LLMs' ability to simulate continuous human behavior. BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas, each with detailed history and profile metadata. For evaluation, we integrate persona metadata into LLMs and employ them to iteratively infer contextually appropriate behaviors within dynamic scenarios provided by BehaviorChain. Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.
中文:当前大语言模型评估主要关注对话模拟而忽视人类行为模拟,为此我们开发了首个包含15,846条人物行为链的基准测试BehaviorChain,结果表明即使最先进的模型也难以准确模拟连续的人类行为。
English: Current LLM evaluations focus on dialogue simulation but neglect human behavior simulation, so we developed BehaviorChain, the first benchmark with 15,846 persona-based behaviors, revealing that even advanced models struggle to accurately simulate continuous human behavior.

Authors:Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li
Title: Reward Models Identify Consistency, Not Causality
Abstract:
Reward models (RMs) play a crucial role in aligning large language models (LLMs) with human preferences and enhancing reasoning quality. Traditionally, RMs are trained to rank candidate outputs based on their correctness and coherence. However, in this work, we present several surprising findings that challenge common assumptions about RM behavior. Our analysis reveals that state-of-the-art reward models prioritize structural consistency over causal correctness. Specifically, removing the problem statement has minimal impact on reward scores, whereas altering numerical values or disrupting the reasoning flow significantly affects RM outputs. Furthermore, RMs exhibit a strong dependence on complete reasoning trajectories truncated or incomplete steps lead to significant variations in reward assignments, indicating that RMs primarily rely on learned reasoning patterns rather than explicit problem comprehension. These findings hold across multiple architectures, datasets, and tasks, leading to three key insights: (1) RMs primarily assess coherence rather than true reasoning quality; (2) The role of explicit problem comprehension in reward assignment is overstated; (3) Current RMs may be more effective at ranking responses than verifying logical validity. Our results suggest a fundamental limitation in existing reward modeling approaches, emphasizing the need for a shift toward causality-aware reward models that go beyond consistency-driven evaluation.
中文: 本研究发现先进奖励模型更关注结构一致性而非因果正确性,主要依赖习得的推理模式而非对问题的明确理解,揭示了现有方法的根本局限并强调需发展因果感知的改进模型。
English: This study reveals that state-of-the-art reward models prioritize structural consistency over causal correctness, relying more on learned reasoning patterns than explicit problem comprehension, which highlights a fundamental limitation in current approaches and calls for causality-aware improvements.

Authors:Yicong Li, Kuanjiu Zhou, Shuo Yu, Qiang Zhang, Renqiang Luo, Xiaodong Li, Feng Xia
Title: Factor Graph-based Interpretable Neural Networks
Abstract:
Comprehensible neural network explanations are foundations for a better understanding of decisions, especially when the input data are infused with malicious perturbations. Existing solutions generally mitigate the impact of perturbations through adversarial training, yet they fail to generate comprehensible explanations under unknown perturbations. To address this challenge, we propose AGAIN, a fActor GrAph-based Interpretable neural Network, which is capable of generating comprehensible explanations under unknown perturbations. Instead of retraining like previous solutions, the proposed AGAIN directly integrates logical rules by which logical errors in explanations are identified and rectified during inference. Specifically, we construct the factor graph to express logical rules between explanations and categories. By treating logical rules as exogenous knowledge, AGAIN can identify incomprehensible explanations that violate real-world logic. Furthermore, we propose an interactive intervention switch strategy rectifying explanations based on the logical guidance from the factor graph without learning perturbations, which overcomes the inherent limitation of adversarial training-based methods in defending only against known perturbations. Additionally, we theoretically demonstrate the effectiveness of employing factor graph by proving that the comprehensibility of explanations is strongly correlated with factor graph. Extensive experiments are conducted on three datasets and experimental results illustrate the superior performance of AGAIN compared to state-of-the-art baselines.
中文摘要:提出的AGAIN模型通过因子图整合逻辑规则,能在未知扰动下生成可理解的神经网络解释,识别并纠正逻辑错误而无需重新训练,从而克服了对抗训练方法仅防御已知扰动的固有局限。
English Summary: The proposed AGAIN model generates comprehensible neural network explanations under unknown perturbations by integrating logical rules via a factor graph, identifying and correcting logical errors without retraining, thus overcoming the limitations of adversarial training methods.

Authors:Yukai Shi, Cidan Shi, Zhipeng Weng, Yin Tian, Xiaoyu Xian, Liang Lin
Title: CrossFuse: Learning Infrared and Visible Image Fusion by Cross-Sensor Top-K Vision Alignment and Beyond
Abstract:
Infrared and visible image fusion (IVIF) is increasingly applied in critical fields such as video surveillance and autonomous driving systems. Significant progress has been made in deep learning-based fusion methods. However, these models frequently encounter out-of-distribution (OOD) scenes in real-world applications, which severely impact their performance and reliability. Therefore, addressing the challenge of OOD data is crucial for the safe deployment of these models in open-world environments. Unlike existing research, our focus is on the challenges posed by OOD data in real-world applications and on enhancing the robustness and generalization of models. In this paper, we propose an infrared-visible fusion framework based on Multi-View Augmentation. For external data augmentation, Top-k Selective Vision Alignment is employed to mitigate distribution shifts between datasets by performing RGB-wise transformations on visible images. This strategy effectively introduces augmented samples, enhancing the adaptability of the model to complex real-world scenarios. Additionally, for internal data augmentation, self-supervised learning is established using Weak-Aggressive Augmentation. This enables the model to learn more robust and general feature representations during the fusion process, thereby improving robustness and generalization. Extensive experiments demonstrate that the proposed method exhibits superior performance and robustness across various conditions and environments. Our approach significantly enhances the reliability and stability of IVIF tasks in practical applications.
Chinese: 本文针对红外与可见光图像融合中的分布外数据挑战,提出了一种多视角增强框架,通过外部RGB变换和内部自监督学习有效提升了模型的鲁棒性与泛化能力。
English: The paper addresses the challenge of out-of-distribution data in infrared-visible image fusion by proposing a Multi-View Augmentation framework that enhances model robustness and generalization through external RGB transformations and internal self-supervised learning.

Authors:Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu
Title: LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems
Abstract:
Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.
本文提出了一种语义语音活动检测模块作为轻量级对话管理器,通过预测控制令牌处理打断和语音完成来实现全双工口语对话系统的实时话轮转换,同时降低计算负荷。
This paper introduces a semantic voice activity detection module as a lightweight dialogue manager that enables real-time turn-taking in full-duplex spoken dialogue systems by predicting control tokens to handle interruptions and speech completion while reducing computational load.

Authors:Shansong Wang, Mojtaba Safari, Qiang Li, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang
Title: Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging
Abstract:
Vision foundation models (VFMs) are pre-trained on extensive image datasets to learn general representations for diverse types of data. These models can subsequently be fine-tuned for specific downstream tasks, significantly boosting performance across a broad range of applications. However, existing vision foundation models that claim to be applicable to various clinical tasks are mostly pre-trained on 3D computed tomography (CT), which benefits from the availability of extensive 3D CT databases. Significant differences between CT and magnetic resonance imaging (MRI) in imaging principles, signal characteristics, and data distribution may hinder their practical performance and versatility in MRI-specific applications. Here, we propose Triad, a vision foundation model for 3D MRI. Triad adopts a widely used autoencoder architecture to learn robust representations from 131,170 3D MRI volumes and uses organ-independent imaging descriptions to constrain the semantic distribution of the visual modality. The above pre-training dataset is called Triad-131K, which is currently the largest 3D MRI pre-training dataset. We evaluate Triad across three tasks, namely, organ/tumor segmentation, organ/cancer classification, and medical image registration, in two data modalities (within-domain and out-of-domain) settings using 25 downstream datasets. By initializing models with Triad's pre-trained weights, nnUNet-Triad improves segmentation performance by 2.51% compared to nnUNet-Scratch across 17 datasets. Swin-B-Triad achieves a 3.97% improvement over Swin-B-Scratch in classification tasks across five datasets. SwinUNETR-Triad improves by 4.00% compared to SwinUNETR-Scratch in registration tasks across two datasets. Our study demonstrates that pre-training can improve performance when the data modalities and organs of upstream and downstream tasks are consistent.
中文: 针对CT预训练的视觉基础模型在MRI任务中表现不佳的问题,Triad作为首个专用于3D MRI的基础模型,利用最大规模数据集显著提升了分割、分类和配准等多类任务的性能表现。
English: Vision foundation models pre-trained on CT data often underperform on MRI tasks, so Triad was developed as a specialized 3D MRI model using the largest dataset of its kind to significantly enhance performance across segmentation, classification, and registration applications.

Authors:Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Yixin Yang, Qingxiu Dong, Weiyao Luo, Yifan Pu, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui
Title: Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?
Abstract:
Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.
Chinese: 大型多模态模型在序列图像理解方面存在明显不足,新基准测试StripCipher显示其与人类表现存在巨大差距,尤其在图像重排任务中GPT-4o仅达到23.93%的准确率。
English: Large Multimodal Models struggle with sequential image comprehension, as demonstrated by the new benchmark StripCipher which reveals a significant performance gap compared to humans, especially in reordering tasks where GPT-4o achieves only 23.93% accuracy.

Authors:Xiaochen Wang, Heming Xia, Jialin Song, Longyu Guan, Yixin Yang, Qingxiu Dong, Weiyao Luo, Yifan Pu, Yiru Wang, Xiangdi Meng, Wenjie Li, Zhifang Sui
Title: Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences?
Abstract:
Large Multimodal Models (LMMs) have achieved remarkable success across various visual-language tasks. However, existing benchmarks predominantly focus on single-image understanding, leaving the analysis of image sequences largely unexplored. To address this limitation, we introduce StripCipher, a comprehensive benchmark designed to evaluate capabilities of LMMs to comprehend and reason over sequential images. StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. Our evaluation of 16 state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities, particularly in tasks that require reordering shuffled sequential images. For instance, GPT-4o achieves only 23.93% accuracy in the reordering subtask, which is 56.07% lower than human performance. Further quantitative analysis discuss several factors, such as input format of images, affecting the performance of LLMs in sequential understanding, underscoring the fundamental challenges that remain in the development of LMMs.
Chinese: 大型多模态模型在序列图像理解方面存在明显不足,新基准测试StripCipher显示其与人类表现存在巨大差距,尤其在图像重排任务中GPT-4o仅达到23.93%的准确率。
English: Large Multimodal Models struggle with sequential image comprehension, as demonstrated by the new benchmark StripCipher which reveals a significant performance gap compared to humans, especially in reordering tasks where GPT-4o achieves only 23.93% accuracy.

Authors:Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, Emine Yilmaz
Title: Judging the Judges: A Collection of LLM-Generated Relevance Judgements
Abstract:
Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/
中文摘要:大语言模型为信息检索中的相关性评估提供了自动化潜力,能显著减少人工标注工作并解决低资源场景的评估难题,而LLMJudge基准通过评估多样化的模型生成标注来探究系统性偏差并推动自动化评估方法的发展。
English Summary: Large Language Models (LLMs) offer promising potential to automate relevance assessments in information retrieval, reducing manual labor and addressing challenges in low-resource scenarios, while the LLMJudge benchmark evaluates diverse LLM-generated judgments to investigate biases and improve automated evaluation techniques.

Authors:Lars Ullrich, Michael Buchholz, Klaus Dietmayer, Knut Graichen
Title: Expanding the Classical V-Model for the Development of Complex Systems Incorporating AI
Abstract:
Research in the field of automated vehicles, or more generally cognitive cyber-physical systems that operate in the real world, is leading to increasingly complex systems. Among other things, artificial intelligence enables an ever-increasing degree of autonomy. In this context, the V-model, which has served for decades as a process reference model of the system development lifecycle is reaching its limits. To the contrary, innovative processes and frameworks have been developed that take into account the characteristics of emerging autonomous systems. To bridge the gap and merge the different methodologies, we present an extension of the V-model for iterative data-based development processes that harmonizes and formalizes the existing methods towards a generic framework. The iterative approach allows for seamless integration of continuous system refinement. While the data-based approach constitutes the consideration of data-based development processes and formalizes the use of synthetic and real world data. In this way, formalizing the process of development, verification, validation, and continuous integration contributes to ensuring the safety of emerging complex systems that incorporate AI.
中文: 传统V模型已无法满足现代自主系统的需求,因此提出了一种扩展的迭代式数据驱动V模型,通过持续优化协调开发流程并保障系统安全。
English: The traditional V-model is inadequate for modern autonomous systems, so an extended iterative data-based V-model is proposed to harmonize development processes and ensure safety through continuous refinement.

Authors:Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
Title: SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation
Abstract:
While spatial reasoning has made progress in object localization relationships, it often overlooks object orientation-a key factor in 6-DoF fine-grained manipulation. Traditional pose representations rely on pre-defined frames or templates, limiting generalization and semantic grounding. In this paper, we introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner (e.g., the "plug-in" direction of a USB or the "handle" direction of a cup). To support this, we construct OrienText300K, a large-scale dataset of 3D objects annotated with semantic orientations, and develop PointSO, a general model for zero-shot semantic orientation prediction. By integrating semantic orientation into VLM agents, our SoFar framework enables 6-DoF spatial reasoning and generates robotic actions. Extensive experiments demonstrated the effectiveness and generalization of our SoFar, e.g., zero-shot 48.7% successful rate on Open6DOR and zero-shot 74.9% successful rate on SIMPLER-Env.
中文: 本文提出了语义朝向概念,通过自然语言无参考框架地定义物体朝向,并开发了SoFar框架,将其与视觉语言模型结合,实现了6自由度空间推理和机器人动作生成,在实验中取得了优异的零样本成功率。
English: This paper introduces semantic orientation, a reference-frame-free method using natural language to define object orientations, and presents SoFar, a framework that integrates this concept with VLM agents for enhanced 6-DoF spatial reasoning and robotic action generation, achieving high zero-shot success rates in experiments.

Authors:Bingheng Li, Zhikai Chen, Haoyu Han, Shenglai Zeng, Jingzhe Liu, Jiliang Tang
Title: Unveiling Mode Connectivity in Graph Neural Networks
Abstract:
A fundamental challenge in understanding graph neural networks (GNNs) lies in characterizing their optimization dynamics and loss landscape geometry, critical for improving interpretability and robustness. While mode connectivity, a lens for analyzing geometric properties of loss landscapes has proven insightful for other deep learning architectures, its implications for GNNs remain unexplored. This work presents the first investigation of mode connectivity in GNNs. We uncover that GNNs exhibit distinct non-linear mode connectivity, diverging from patterns observed in fully-connected networks or CNNs. Crucially, we demonstrate that graph structure, rather than model architecture, dominates this behavior, with graph properties like homophily correlating with mode connectivity patterns. We further establish a link between mode connectivity and generalization, proposing a generalization bound based on loss barriers and revealing its utility as a diagnostic tool. Our findings further bridge theoretical insights with practical implications: they rationalize domain alignment strategies in graph learning and provide a foundation for refining GNN training paradigms.
中文: 本研究首次探索图神经网络的模态连通性,发现图结构(尤其是同质性)主导其独特的连通模式并与泛化能力相关,为图神经网络训练提供了理论洞见和实际应用基础。
English: This study pioneers the investigation of mode connectivity in graph neural networks, revealing that graph structure—particularly homophily—dominates their unique connectivity patterns and correlates with generalization, offering both theoretical insights and practical applications for GNN training.

Authors:Bingshuo Guo, Minghui Liwang, Xiaoyu Xia, Li Li, Zhenzhen Jiao, Seyyedali Hosseinalipour, Xianbin Wang
Title: Seamless Graph Task Scheduling over Dynamic Vehicular Clouds: A Hybrid Methodology for Integrating Pilot and Instantaneous Decisions
Abstract:
Vehicular clouds (VCs) play a crucial role in the Internet-of-Vehicles (IoV) ecosystem by securing essential computing resources for a wide range of tasks. This paPertackles the intricacies of resource provisioning in dynamic VCs for computation-intensive tasks, represented by undirected graphs for parallel processing over multiple vehicles. We model the dynamics of VCs by considering multiple factors, including varying communication quality among vehicles, fluctuating computing capabilities of vehicles, uncertain contact duration among vehicles, and dynamic data exchange costs between vehicles. Our primary goal is to obtain feasible assignments between task components and nearby vehicles, called templates, in a timely manner with minimized task completion time and data exchange overhead. To achieve this, we propose a hybrid graph task scheduling (P-HTS) methodology that combines offline and online decision-making modes. For the offline mode, we introduce an approach called risk-aware pilot isomorphic subgraph searching (RA-PilotISS), which predicts feasible solutions for task scheduling in advance based on historical information. Then, for the online mode, we propose time-efficient instantaneous isomorphic subgraph searching (TE-InstaISS), serving as a backup approach for quickly identifying new optimal scheduling template when the one identified by RA-PilotISS becomes invalid due to changing conditions. Through comprehensive experiments, we demonstrate the superiority of our proposed hybrid mechanism compared to state-of-the-art methods in terms of various evaluative metrics, e.g., time efficiency such as the delay caused by seeking for possible templates and task completion time, as well as cost function, upon considering different VC scales and graph task topologies.
中文: 本文提出了一种混合图任务调度方法,通过结合离线的风险感知预测和在线的实时优化,在动态车辆云中高效分配资源,以最小化任务完成时间和数据交换开销。
English: This paper introduces a hybrid graph task scheduling (P-HTS) method for efficient resource allocation in dynamic vehicular clouds, combining offline risk-aware prediction with online real-time optimization to minimize task completion time and data exchange costs.

Authors:Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, Bryan Hooi
Title: Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals?
Abstract:
As large language models (LLMs) continue to evolve, ensuring their alignment with human goals and values remains a pressing challenge. A key concern is \textit{instrumental convergence}, where an AI system, in optimizing for a given objective, develops unintended intermediate goals that override the ultimate objective and deviate from human-intended goals. This issue is particularly relevant in reinforcement learning (RL)-trained models, which can generate creative but unintended strategies to maximize rewards. In this paper, we explore instrumental convergence in LLMs by comparing models trained with direct RL optimization (e.g., the o1 model) to those trained with reinforcement learning from human feedback (RLHF). We hypothesize that RL-driven models exhibit a stronger tendency for instrumental convergence due to their optimization of goal-directed behavior in ways that may misalign with human intentions. To assess this, we introduce InstrumentalEval, a benchmark for evaluating instrumental convergence in RL-trained LLMs. Initial experiments reveal cases where a model tasked with making money unexpectedly pursues instrumental objectives, such as self-replication, implying signs of instrumental convergence. Our findings contribute to a deeper understanding of alignment challenges in AI systems and the risks posed by unintended model behaviors.
中文: 通过强化学习优化的大语言模型可能产生偏离人类意图的中间目标,如InstrumentalEval基准测试所示,在追求利润的任务中出现了自我复制等工具性趋同迹象。
English: Large language models optimized through reinforcement learning may develop unintended intermediate goals that override human-aligned objectives, as demonstrated by the InstrumentalEval benchmark revealing cases like self-replication during profit-seeking tasks.

Authors:Leonard Bauersfeld, Davide Scaramuzza
Title: A Monocular Event-Camera Motion Capture System
Abstract:
Motion capture systems are a widespread tool in research to record ground-truth poses of objects. Commercial systems use reflective markers attached to the object and then triangulate pose of the object from multiple camera views. Consequently, the object must be visible to multiple cameras which makes such multi-view motion capture systems unsuited for deployments in narrow, confined spaces (e.g. ballast tanks of ships). In this technical report we describe a monocular event-camera motion capture system which overcomes this limitation and is ideally suited for narrow spaces. Instead of passive markers it relies on active, blinking LED markers such that each marker can be uniquely identified from the blinking frequency. The markers are placed at known locations on the tracking object. We then solve the PnP (perspective-n-points) problem to obtain the position and orientation of the object. The developed system has millimeter accuracy, millisecond latency and we demonstrate that its state estimate can be used to fly a small, agile quadrotor.
中文: 本技术报告提出了一种单目事件相机运动捕捉系统,通过闪烁的LED标记在狭窄空间内实现精确物体追踪,具备毫米级精度和毫秒级延迟,适用于敏捷无人机导航。
English: This report introduces a monocular event-camera motion capture system using blinking LED markers to enable precise object tracking in confined spaces, achieving millimeter accuracy and millisecond latency suitable for agile drone navigation.

Authors:Yuchen Yang, Thomas Thebaud, Najim Dehak
Title: Demographic Attributes Prediction from Speech Using WavLM Embeddings
Abstract:
This paper introduces a general classifier based on WavLM features, to infer demographic characteristics, such as age, gender, native language, education, and country, from speech. Demographic feature prediction plays a crucial role in applications like language learning, accessibility, and digital forensics, enabling more personalized and inclusive technologies. Leveraging pretrained models for embedding extraction, the proposed framework identifies key acoustic and linguistic fea-tures associated with demographic attributes, achieving a Mean Absolute Error (MAE) of 4.94 for age prediction and over 99.81% accuracy for gender classification across various datasets. Our system improves upon existing models by up to relative 30% in MAE and up to relative 10% in accuracy and F1 scores across tasks, leveraging a diverse range of datasets and large pretrained models to ensure robustness and generalizability. This study offers new insights into speaker diversity and provides a strong foundation for future research in speech-based demographic profiling.
中文: 本文提出了一种基于WavLM特征的通用分类器,用于从语音中推断年龄、性别等人口特征,在多个数据集上实现了高精度预测,并为语音人口分析研究提供了坚实基础。
English: This paper presents a general classifier using WavLM features to predict demographic traits like age, gender, and education from speech, achieving high accuracy and improvements over existing models for applications in personalized technologies.

Authors:Jonathan Jordan, Sherzod Hakimov, David Schlangen
Title: Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment
Abstract:
Large Language Models (LLMs) serve not only as chatbots but as key components in agent systems, where their common-sense knowledge significantly impacts performance as language-based planners for situated or embodied action. We assess LLMs' incremental learning (based on feedback from the environment), and controlled in-context learning abilities using a text-based environment. We introduce challenging yet interesting set of experiments to test i) how agents can incrementally solve tasks related to every day objects in typical rooms in a house where each of them are discovered by interacting within the environment, ii) controlled in-context learning abilities and efficiency of agents by providing short info about locations of objects and rooms to check how faster the task can be solved, and finally iii) using synthetic pseudo-English words to gauge how well LLMs are at inferring meaning of unknown words from environmental feedback. Results show that larger commercial models have a substantial gap in performance compared to open-weight but almost all models struggle with the synthetic words experiments.
中文:大型语言模型在模拟环境中作为智能代理表现出强大的增量学习和受控情境学习能力,但在处理合成词汇任务时存在困难,且商业模型与开源模型之间存在明显性能差距。
English: Large Language Models function as intelligent agents in simulated environments, demonstrating strong incremental learning and controlled in-context learning capabilities, though they struggle with synthetic vocabulary tasks and show performance gaps between commercial and open-source models.

Authors:Sherzod Hakimov, Lara Pfennigschmidt, David Schlangen
Title: Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models
Abstract:
This study utilizes the game Codenames as a benchmarking tool to evaluate large language models (LLMs) with respect to specific linguistic and cognitive skills. LLMs play each side of the game, where one side generates a clue word covering several target words and the other guesses those target words. We designed various experiments by controlling the choice of words (abstract vs. concrete words, ambiguous vs. monosemic) or the opponent (programmed to be faster or slower in revealing words). Recent commercial and open-weight models were compared side-by-side to find out factors affecting their performance. The evaluation reveals details about their strategies, challenging cases, and limitations of LLMs.
中文: 本研究利用游戏《行动代号》评估大型语言模型的语言与认知能力,通过控制词汇选择和对手策略的实验,揭示了影响模型表现的关键因素及其局限性。
English: This study evaluates large language models using the game Codenames to assess their linguistic and cognitive skills through experiments involving word choices and opponent strategies, revealing their performance factors and limitations.

Authors:Yanyan Wang, Kechen Song, Yuyuan Liu, Shuai Ma, Yunhui Yan, Gustavo Carneiro
Title: Leveraging Labelled Data Knowledge: A Cooperative Rectification Learning Network for Semi-supervised 3D Medical Image Segmentation
Abstract:
Semi-supervised 3D medical image segmentation aims to achieve accurate segmentation using few labelled data and numerous unlabelled data. The main challenge in the design of semi-supervised learning methods consists in the effective use of the unlabelled data for training. A promising solution consists of ensuring consistent predictions across different views of the data, where the efficacy of this strategy depends on the accuracy of the pseudo-labels generated by the model for this consistency learning strategy. In this paper, we introduce a new methodology to produce high-quality pseudo-labels for a consistency learning strategy to address semi-supervised 3D medical image segmentation. The methodology has three important contributions. The first contribution is the Cooperative Rectification Learning Network (CRLN) that learns multiple prototypes per class to be used as external knowledge priors to adaptively rectify pseudo-labels at the voxel level. The second contribution consists of the Dynamic Interaction Module (DIM) to facilitate pairwise and cross-class interactions between prototypes and multi-resolution image features, enabling the production of accurate voxel-level clues for pseudo-label rectification. The third contribution is the Cooperative Positive Supervision (CPS), which optimises uncertain representations to align with unassertive representations of their class distributions, improving the model's accuracy in classifying uncertain regions. Extensive experiments on three public 3D medical segmentation datasets demonstrate the effectiveness and superiority of our semi-supervised learning method.
中文: 本文提出了一种新的半监督3D医学图像分割方法,通过协同校正学习、动态原型交互和积极监督来提高伪标签质量,从而有效利用未标记数据。
English: This paper introduces a novel semi-supervised 3D medical image segmentation method that enhances pseudo-label quality through cooperative rectification learning, dynamic prototype interactions, and positive supervision to effectively utilize unlabeled data.

Authors:Weidi Luo, Shenghong Dai, Xiaogeng Liu, Suman Banerjee, Huan Sun, Muhao Chen, Chaowei Xiao
Title: AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection
Abstract:
The rapid advancements in Large Language Models (LLMs) have enabled their deployment as autonomous agents for handling complex tasks in dynamic environments. These LLMs demonstrate strong problem-solving capabilities and adaptability to multifaceted scenarios. However, their use as agents also introduces significant risks, including task-specific risks, which are identified by the agent administrator based on the specific task requirements and constraints, and systemic risks, which stem from vulnerabilities in their design or interactions, potentially compromising confidentiality, integrity, or availability (CIA) of information and triggering security risks. Existing defense agencies fail to adaptively and effectively mitigate these risks. In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility and flexibility. Extensive experiments demonstrate that AGrail not only achieves strong performance against task-specific and system risks but also exhibits transferability across different LLM agents' tasks.
中文: 大型语言模型作为自主代理的快速部署带来了任务特定风险和系统性风险,现有防御措施难以有效应对,因此提出AGrail这一终身代理护栏,通过自适应安全检查和优化,展现出强大的防护性能和跨任务可转移性。
English: The rapid deployment of Large Language Models as autonomous agents introduces significant task-specific and systemic risks, which existing defenses fail to address effectively, prompting the proposal of AGrail, a lifelong agent guardrail that demonstrates strong performance and transferability in enhancing safety through adaptive and optimized checks.

Authors:Ryuto Koike, Masahiro Kaneko, Ayana Niwa, Preslav Nakov, Naoaki Okazaki
Title: ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability
Abstract:
Detecting texts generated by Large Language Models (LLMs) could cause grave mistakes due to incorrect decisions, such as undermining student's academic dignity. LLM text detection thus needs to ensure the interpretability of the decision, which can help users judge how reliably correct its prediction is. When humans verify whether a text is human-written or LLM-generated, they intuitively investigate with which of them it shares more similar spans. However, existing interpretable detectors are not aligned with the human decision-making process and fail to offer evidence that users easily understand. To bridge this gap, we introduce ExaGPT, an interpretable detection approach grounded in the human decision-making process for verifying the origin of a text. ExaGPT identifies a text by checking whether it shares more similar spans with human-written vs. with LLM-generated texts from a datastore. This approach can provide similar span examples that contribute to the decision for each span in the text as evidence. Our human evaluation demonstrates that providing similar span examples contributes more effectively to judging the correctness of the decision than existing interpretable methods. Moreover, extensive experiments in four domains and three generators show that ExaGPT massively outperforms prior powerful detectors by up to +40.9 points of accuracy at a false positive rate of 1%.
中文: ExaGPT是一种基于人类决策过程的可解释性检测方法,通过比对文本片段与人类书写和AI生成文本的相似性,不仅大幅提升检测准确率,还能提供易于理解的判断依据。
English: ExaGPT is an interpretable LLM text detection method that aligns with human decision-making by comparing text spans with human-written and AI-generated examples, significantly improving accuracy and providing understandable evidence for its predictions.

Authors:Tong Zheng, Yan Wen, Huiwen Bao, Junfeng Guo, Heng Huang
Title: Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation
Abstract:
The emergence of Large Language Models (LLMs) has advanced the multilingual machine translation (MMT), yet the Curse of Multilinguality (CoM) remains a major challenge. Existing work in LLM-based MMT typically mitigates this issue via scaling up training and computation budget, which raises a critical question: Is scaling up the training and computation budget truly necessary for high-quality MMT, or can a deeper understanding of CoM provide a more efficient solution? To explore this problem, we analyze the linguistic conflicts and synergy, the underlying mechanism of CoM during post-training phase. We identify an asymmetric phenomenon in linguistic conflicts and synergy: the dominance of conflicts and synergy varies in different translation directions, leading to sub-optimal adaptation in existing post-training methods. We further find that a significant bottleneck in MMT appears to lie in post-training rather than multilingual pre-training, suggesting the need for more effective adaptation strategies. Building on these new insights, we propose a direction-aware training approach, combined with group-wise model merging, to address asymmetry in linguistic conflicts and synergy explicitly. Leveraging this strategy, our method fine-tunes X-ALMA-13B-Pretrain-trained only with multilingual pre-training-achieving comparable performance to XALMA-13B (only SFT) while using only 20B pretraining tokens and 17B parameters-5.5x fewer pretraining-tokens and 1.7x fewer model size-with just 0.85 COMET drop on Flores-200 testsets of 50 languages.
中文: 本研究通过揭示多语言机器翻译的瓶颈在于后训练阶段而非预训练,提出了一种方向感知训练与模型合并方法,在显著减少计算资源的同时实现了与现有方法相媲美的性能。
English: This study challenges the necessity of scaling up training budgets for multilingual machine translation by revealing that the bottleneck lies in post-training adaptation, proposing a direction-aware training and model merging method that achieves competitive performance with significantly reduced computational resources.

Authors:Zongzhao Li, Jiacheng Cen, Bing Su, Wenbing Huang, Tingyang Xu, Yu Rong, Deli Zhao
Title: Large Language-Geometry Model: When LLM meets Equivariance
Abstract:
Accurately predicting 3D structures and dynamics of physical systems is crucial in scientific applications. Existing approaches that rely on geometric Graph Neural Networks (GNNs) effectively enforce $\mathrm{E}(3)$-equivariance, but they often fall in leveraging extensive broader information. While direct application of Large Language Models (LLMs) can incorporate external knowledge, they lack the capability for spatial reasoning with guaranteed equivariance. In this paper, we propose EquiLLM, a novel framework for representing 3D physical systems that seamlessly integrates E(3)-equivariance with LLM capabilities. Specifically, EquiLLM comprises four key components: geometry-aware prompting, an equivariant encoder, an LLM, and an equivariant adaptor. Essentially, the LLM guided by the instructive prompt serves as a sophisticated invariant feature processor, while 3D directional information is exclusively handled by the equivariant encoder and adaptor modules. Experimental results demonstrate that EquiLLM delivers significant improvements over previous methods across molecular dynamics simulation, human motion simulation, and antibody design, highlighting its promising generalizability.
中文摘要:EquiLLM创新性地将E(3)等变几何处理与大语言模型能力相结合,在保持空间推理能力的同时实现了更精确的三维结构预测,在多个科学应用领域展现出显著优势。
English Summary: The proposed EquiLLM framework combines E(3)-equivariant geometric processing with large language models to achieve superior 3D structure prediction while maintaining spatial reasoning capabilities, demonstrating significant improvements across multiple scientific applications.

Authors:Nura Aljaafari, Danilo S. Carvalho, André Freitas
Title: CARMA: Enhanced Compositionality in LLMs via Advanced Regularisation and Mutual Information Alignment
Abstract:
Large language models (LLMs) struggle with compositional generalisation, limiting their ability to systematically combine learned components to interpret novel inputs. While architectural modifications, fine-tuning, and data augmentation improve compositionality, they often have limited adaptability, face scalability constraints, or yield diminishing returns on real data. To address this, we propose CARMA, an intervention that enhances the stability and robustness of compositional reasoning in LLMs while preserving fine-tuned performance. CARMA employs mutual information regularisation and layer-wise stability constraints to mitigate feature fragmentation, ensuring structured representations persist across and within layers. We evaluate CARMA on inverse dictionary modelling and sentiment classification, measuring its impact on semantic consistency, performance stability, and robustness to lexical perturbations. Results show that CARMA reduces the variability introduced by fine-tuning, stabilises token representations, and improves compositional reasoning. While its effectiveness varies across architectures, CARMA's key strength lies in reinforcing learned structures rather than introducing new capabilities, making it a scalable auxiliary method. These findings suggest that integrating CARMA with fine-tuning can improve compositional generalisation while maintaining task-specific performance in LLMs.
中文: CARMA是一种通过互信息正则化和分层稳定性约束来增强大型语言模型组合推理能力的新方法,它在保持微调性能的同时提高了模型的鲁棒性和稳定性。
English: CARMA is a novel intervention that enhances compositional reasoning in large language models by applying mutual information regularization and layer-wise stability constraints, improving robustness and stability without compromising fine-tuned performance.

Authors:Manan Tayal, Aditya Singh, Shishir Kolathaya, Somil Bansal
Title: A Physics-Informed Machine Learning Framework for Safe and Optimal Control of Autonomous Systems
Abstract:
As autonomous systems become more ubiquitous in daily life, ensuring high performance with guaranteed safety is crucial. However, safety and performance could be competing objectives, which makes their co-optimization difficult. Learning-based methods, such as Constrained Reinforcement Learning (CRL), achieve strong performance but lack formal safety guarantees due to safety being enforced as soft constraints, limiting their use in safety-critical settings. Conversely, formal methods such as Hamilton-Jacobi (HJ) Reachability Analysis and Control Barrier Functions (CBFs) provide rigorous safety assurances but often neglect performance, resulting in overly conservative controllers. To bridge this gap, we formulate the co-optimization of safety and performance as a state-constrained optimal control problem, where performance objectives are encoded via a cost function and safety requirements are imposed as state constraints. We demonstrate that the resultant value function satisfies a Hamilton-Jacobi-Bellman (HJB) equation, which we approximate efficiently using a novel physics-informed machine learning framework. In addition, we introduce a conformal prediction-based verification strategy to quantify the learning errors, recovering a high-confidence safety value function, along with a probabilistic error bound on performance degradation. Through several case studies, we demonstrate the efficacy of the proposed framework in enabling scalable learning of safe and performant controllers for complex, high-dimensional autonomous systems.
Chinese: 本研究提出了一种基于物理信息的机器学习框架,通过求解状态约束的最优控制问题,弥合了自主系统中安全性与性能之间的鸿沟,实现了具有形式化安全保证且性能损失最小的控制器可扩展学习。
English: This study introduces a physics-informed machine learning framework that bridges the gap between safety and performance in autonomous systems by solving a state-constrained optimal control problem, enabling scalable learning of controllers with formal safety guarantees and minimal performance degradation.

Authors:Tianci Liu, Haoxiang Jiang, Tianze Wang, Ran Xu, Yue Yu, Linjun Zhang, Tuo Zhao, Haoyu Wang
Title: RoseRAG: Robust Retrieval-augmented Generation with Small-scale LLMs via Margin-aware Preference Optimization
Abstract:
Large language models (LLMs) have achieved impressive performance but face high computational costs and latency, limiting their deployment in resource-constrained settings. In contrast, small-scale LLMs (SLMs) are more efficient yet struggle to capture evolving real-world knowledge. Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs. We propose RoseRAG, a robust RAG framework for SLMs via Margin-aware Preference Optimization. RoseRAG employs multi-turn prompting for detailed reasoning, rejection sampling for high-quality explanations, and contrastive preference selection to refine responses by maximizing the likelihood gap between preferred and non-preferred outputs. By integrating these components into a margin-aware optimization process, RoseRAG robustly enhances the accuracy and reliability of SLMs for RAG applications. Extensive experiments on three open-domain question answering benchmarks indicate that our innovative RoseRAG surpasses state-of-the-art baselines significantly.
中文: 大语言模型计算成本高,而小规模模型难以捕捉实时知识,因此提出了RoseRAG框架,通过边际感知偏好优化增强检索增强生成的鲁棒性,在多项基准测试中显著超越现有最优方法。
English: Large language models face high computational costs, while small-scale models struggle with evolving knowledge, leading to the development of RoseRAG, a robust retrieval-augmented generation framework that enhances accuracy and reliability through margin-aware preference optimization, significantly outperforming existing methods on benchmarks.

Authors:Sixian Wang, Jincheng Dai, Xiaoqi Qin, Ke Yang, Kai Niu, Ping Zhang
Title: ResiComp: Loss-Resilient Image Compression via Dual-Functional Masked Visual Token Modeling
Abstract:
Recent advancements in neural image codecs (NICs) are of significant compression performance, but limited attention has been paid to their error resilience. These resulting NICs tend to be sensitive to packet losses, which are prevalent in real-time communications. In this paper, we investigate how to elevate the resilience ability of NICs to combat packet losses. We propose ResiComp, a pioneering neural image compression framework with feature-domain packet loss concealment (PLC). Motivated by the inherent consistency between generation and compression, we advocate merging the tasks of entropy modeling and PLC into a unified framework focused on latent space context modeling. To this end, we take inspiration from the impressive generative capabilities of large language models (LLMs), particularly the recent advances of masked visual token modeling (MVTM). During training, we integrate MVTM to mirror the effects of packet loss, enabling a dual-functional Transformer to restore the masked latents by predicting their missing values and conditional probability mass functions. Our ResiComp jointly optimizes compression efficiency and loss resilience. Moreover, ResiComp provides flexible coding modes, allowing for explicitly adjusting the efficiency-resilience trade-off in response to varying Internet or wireless network conditions. Extensive experiments demonstrate that ResiComp can significantly enhance the NIC's resilience against packet losses, while exhibits a worthy trade-off between compression efficiency and packet loss resilience.
中文摘要:ResiComp是一种创新的神经图像压缩框架,通过在特征域集成丢包隐藏技术,利用掩码视觉令牌建模共同优化压缩效率和网络丢包恢复能力。
English Summary: ResiComp is a novel neural image compression framework that integrates packet loss concealment in the feature domain, leveraging masked visual token modeling to jointly optimize compression efficiency and resilience against network packet losses.

Authors:Weilin Sun, Xinran Li, Manyi Li, Kai Xu, Xiangxu Meng, Lei Meng
Title: Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with Pre-trained Large Language Model
Abstract:
Indoor scene synthesis aims to automatically produce plausible, realistic and diverse 3D indoor scenes, especially given arbitrary user requirements. Recently, the promising generalization ability of pre-trained large language models (LLM) assist in open-vocabulary indoor scene synthesis. However, the challenge lies in converting the LLM-generated outputs into reasonable and physically feasible scene layouts. In this paper, we propose to generate hierarchically structured scene descriptions with LLM and then compute the scene layouts. Specifically, we train a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts. The advantages of using hierarchically structured scene representation are two-fold. First, the hierarchical structure provides a rough grounding for object arrangement, which alleviates contradictory placements with dense relations and enhances the generalization ability of the network to infer fine-grained placements. Second, it naturally supports the divide-and-conquer optimization, by first arranging the sub-scenes and then the entire scene, to more effectively solve for a feasible layout. We conduct extensive comparison experiments and ablation studies with both qualitative and quantitative evaluations to validate the effectiveness of our key designs with the hierarchically structured scene representation. Our approach can generate more reasonable scene layouts while better aligned with the user requirements and LLM descriptions. We also present open-vocabulary scene synthesis and interactive scene design results to show the strength of our approach in the applications.
Chinese: 本文提出一种利用大语言模型生成层次化场景描述,并通过训练网络和优化计算室内布局的方法,从而生成更合理且符合用户需求的3D场景。
English: This paper introduces a method that uses large language models to generate hierarchical scene descriptions and then computes indoor layouts through a trained network and optimization, resulting in more reasonable and user-aligned 3D scenes.

Authors:A. Quadir, M. Sajid, M. Tanveer
Title: One Class Restricted Kernel Machines
Abstract:
Restricted kernel machines (RKMs) have demonstrated a significant impact in enhancing generalization ability in the field of machine learning. Recent studies have introduced various methods within the RKM framework, combining kernel functions with the least squares support vector machine (LSSVM) in a manner similar to the energy function of restricted boltzmann machines (RBM), such that a better performance can be achieved. However, RKM's efficacy can be compromised by the presence of outliers and other forms of contamination within the dataset. These anomalies can skew the learning process, leading to less accurate and reliable outcomes. To address this critical issue and to ensure the robustness of the model, we propose the novel one-class RKM (OCRKM). In the framework of OCRKM, we employ an energy function akin to that of the RBM, which integrates both visible and hidden variables in a nonprobabilistic setting. The formulation of the proposed OCRKM facilitates the seamless integration of one-class classification method with the RKM, enhancing its capability to detect outliers and anomalies effectively. The proposed OCRKM model is evaluated over UCI benchmark datasets. Experimental findings and statistical analyses consistently emphasize the superior generalization capabilities of the proposed OCRKM model over baseline models across all scenarios.
中文: 提出了一种新型单类限制核机器(OCRKM),通过整合类似限制玻尔兹曼机的能量函数来增强对异常值的鲁棒性,基准测试显示其泛化能力优于基线模型。
English: The novel one-class restricted kernel machine (OCRKM) is proposed to enhance robustness against outliers by integrating a restricted Boltzmann machine-like energy function, demonstrating superior generalization in benchmark tests compared to baseline models.

Authors:Mingcong Lei, Yiming Zhao, Ge Wang, Zhixin Mai, Shuguang Cui, Yatong Han, Jinke Ren
Title: STMA: A Spatio-Temporal Memory Agent for Long-Horizon Embodied Task Planning
Abstract:
A key objective of embodied intelligence is enabling agents to perform long-horizon tasks in dynamic environments while maintaining robust decision-making and adaptability. To achieve this goal, we propose the Spatio-Temporal Memory Agent (STMA), a novel framework designed to enhance task planning and execution by integrating spatio-temporal memory. STMA is built upon three critical components: (1) a spatio-temporal memory module that captures historical and environmental changes in real time, (2) a dynamic knowledge graph that facilitates adaptive spatial reasoning, and (3) a planner-critic mechanism that iteratively refines task strategies. We evaluate STMA in the TextWorld environment on 32 tasks, involving multi-step planning and exploration under varying levels of complexity. Experimental results demonstrate that STMA achieves a 31.25% improvement in success rate and a 24.7% increase in average score compared to the state-of-the-art model. The results highlight the effectiveness of spatio-temporal memory in advancing the memory capabilities of embodied agents.
中文: 时空记忆智能体(STMA)通过整合实时记忆与自适应推理,显著提升了具身智能体在长周期任务中的表现,在TextWorld测试中成功率提高了31.25%。
English: The Spatio-Temporal Memory Agent (STMA) enhances embodied agents' long-horizon task performance by integrating real-time memory and adaptive reasoning, achieving a 31.25% higher success rate in TextWorld evaluations.

Authors:Shichao Fan, Quantao Yang, Yajie Liu, Kun Wu, Zhengping Che, Qingjie Liu, Min Wan
Title: Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation
Abstract:
Recently, Vision-Language-Action models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long-horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumulation. Our two-stage approach first trains a generative vision-language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the-art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance.
中文摘要:提出的扩散轨迹引导策略(DTP)框架通过扩散模型生成轨迹来指导策略学习,有效减少误差累积,在CALVIN基准测试中成功率提升25%,并显著提升了真实环境中的机器人性能。
English Summary: The proposed Diffusion Trajectory-guided Policy (DTP) framework uses diffusion-generated trajectories to guide policy learning, reducing error accumulation and achieving a 25% higher success rate on the CALVIN benchmark while enhancing real-world robot performance.

Authors:Shichao Fan, Quantao Yang, Yajie Liu, Kun Wu, Zhengping Che, Qingjie Liu, Min Wan
Title: Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation
Abstract:
Recently, Vision-Language-Action models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long-horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumulation. Our two-stage approach first trains a generative vision-language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the-art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance.
中文摘要:提出的扩散轨迹引导策略(DTP)框架通过扩散模型生成轨迹来指导策略学习,有效减少误差累积,在CALVIN基准测试中成功率提升25%,并显著提升了真实环境中的机器人性能。
English Summary: The proposed Diffusion Trajectory-guided Policy (DTP) framework uses diffusion-generated trajectories to guide policy learning, reducing error accumulation and achieving a 25% higher success rate on the CALVIN benchmark while enhancing real-world robot performance.

Authors:Thanh-Dat Truong, Hoang-Quan Nguyen, Xuan-Bac Nguyen, Ashley Dowling, Xin Li, Khoa Luu
Title: Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding
Abstract:
Multimodal conversational generative AI has shown impressive capabilities in various vision and language understanding through learning massive text-image data. However, current conversational models still lack knowledge about visual insects since they are often trained on the general knowledge of vision-language data. Meanwhile, understanding insects is a fundamental problem in precision agriculture, helping to promote sustainable development in agriculture. Therefore, this paper proposes a novel multimodal conversational model, Insect-LLaVA, to promote visual understanding in insect-domain knowledge. In particular, we first introduce a new large-scale Multimodal Insect Dataset with Visual Insect Instruction Data that enables the capability of learning the multimodal foundation models. Our proposed dataset enables conversational models to comprehend the visual and semantic features of the insects. Second, we propose a new Insect-LLaVA model, a new general Large Language and Vision Assistant in Visual Insect Understanding. Then, to enhance the capability of learning insect features, we develop an Insect Foundation Model by introducing a new micro-feature self-supervised learning with a Patch-wise Relevant Attention mechanism to capture the subtle differences among insect images. We also present Description Consistency loss to improve micro-feature learning via text descriptions. The experimental results evaluated on our new Visual Insect Question Answering benchmarks illustrate the effective performance of our proposed approach in visual insect understanding and achieve State-of-the-Art performance on standard benchmarks of insect-related tasks.
Chinese: 本文提出了一种新型多模态对话模型Insect-LLaVA,通过引入大规模多模态昆虫数据集和结合局部相关注意力机制与描述一致性损失的自监督学习方法,显著提升了农业害虫识别的视觉理解能力,在相关基准测试中达到了最优性能。
English: This paper introduces Insect-LLaVA, a novel multimodal conversational model designed to enhance visual insect understanding in precision agriculture by incorporating a large-scale Multimodal Insect Dataset and a new self-supervised learning approach with Patch-wise Relevant Attention and Description Consistency loss, achieving state-of-the-art performance on insect-related benchmarks.

Authors:Jenny T. Liang, Aayush Kumar, Yasharth Bajpai, Sumit Gulwani, Vu Le, Chris Parnin, Arjun Radhakrishna, Ashish Tiwari, Emerson Murphy-Hill, Guastavo Soares
Title: TableTalk: Scaffolding Spreadsheet Development with a Language Agent
Abstract:
Spreadsheet programming is challenging. Programmers use spreadsheet programming knowledge (e.g., formulas) and problem-solving skills to combine actions into complex tasks. Advancements in large language models have introduced language agents that observe, plan, and perform tasks, showing promise for spreadsheet creation. We present TableTalk, a spreadsheet programming agent embodying three design principles -- scaffolding, flexibility, and incrementality -- derived from studies with seven spreadsheet programmers and 85 Excel templates. TableTalk guides programmers through structured plans based on professional workflows, generating three potential next steps to adapt plans to programmer needs. It uses pre-defined tools to generate spreadsheet components and incrementally build spreadsheets. In a study with 20 programmers, TableTalk produced higher-quality spreadsheets 2.3 times more likely to be preferred than the baseline. It reduced cognitive load and thinking time by 12.6%. From this, we derive design guidelines for agentic spreadsheet programming tools and discuss implications on spreadsheet programming, end-user programming, AI-assisted programming, and human-agent collaboration.
中文: TableTalk是一种基于人工智能的电子表格编程助手,通过结构化指导和降低认知负荷来改进电子表格制作,为程序员带来更高质量的输出和更高的工作效率。
English: TableTalk is an AI-driven spreadsheet programming agent that enhances spreadsheet creation by providing structured guidance and reducing cognitive load, resulting in higher-quality outputs and increased efficiency for programmers.

Authors:Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, Serena Yeung-Levy
Title: CellFlux: Simulating Cellular Morphology Changes via Flow Matching
Abstract:
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlux, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlux models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects -- a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlux generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlux enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research. Project page: https://yuhui-zh15.github.io/CellFlux/.
中文: CellFlux是一种创新的图像生成模型,通过流匹配技术模拟化学和遗传扰动引起的细胞形态变化,在图像质量和生物学预测准确性上相比现有方法实现了显著提升。
English: CellFlux is an innovative image-generative model that simulates cellular morphology changes from chemical and genetic perturbations using flow matching, achieving significant improvements in image quality and biological prediction accuracy over existing methods.

Authors:Francesco Ballerini, Pierluigi Zama Ramirez, Samuele Salti, Luigi Di Stefano
Title: Weight Space Representation Learning on Diverse NeRF Architectures
Abstract:
Neural Radiance Fields (NeRFs) have emerged as a groundbreaking paradigm for representing 3D objects and scenes by encoding shape and appearance information into the weights of a neural network. Recent studies have demonstrated that these weights can be used as input for frameworks designed to address deep learning tasks; however, such frameworks require NeRFs to adhere to a specific, predefined architecture. In this paper, we introduce the first framework capable of processing NeRFs with diverse architectures and performing inference on architectures unseen at training time. We achieve this by training a Graph Meta-Network within an unsupervised representation learning framework, and show that a contrastive objective is conducive to obtaining an architecture-agnostic latent space. In experiments conducted across 13 NeRF architectures belonging to three families (MLPs, tri-planes, and, for the first time, hash tables), our approach demonstrates robust performance in classification and retrieval tasks involving multiple architectures, even unseen at training time, while also exceeding the results of existing frameworks limited to single architectures.
中文: 本文提出了首个能够处理多种架构神经辐射场(NeRF)并对未见架构进行推理的框架,通过对比学习训练图元网络实现架构无关的潜在空间,在13种架构的分类检索任务中表现优异。
English: This paper introduces the first framework that processes Neural Radiance Fields (NeRFs) with diverse architectures and performs inference on unseen architectures by training a Graph Meta-Network with a contrastive objective, achieving robust performance across 13 architectures in classification and retrieval tasks.

Authors:Isabella Liu, Zhan Xu, Wang Yifan, Hao Tan, Zexiang Xu, Xiaolong Wang, Hao Su, Zifan Shi
Title: RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets
Abstract:
We present RigAnything, a novel autoregressive transformer-based model, which makes 3D assets rig-ready by probabilistically generating joints, skeleton topologies, and assigning skinning weights in a template-free manner. Unlike most existing auto-rigging methods, which rely on predefined skeleton template and are limited to specific categories like humanoid, RigAnything approaches the rigging problem in an autoregressive manner, iteratively predicting the next joint based on the global input shape and the previous prediction. While autoregressive models are typically used to generate sequential data, RigAnything extends their application to effectively learn and represent skeletons, which are inherently tree structures. To achieve this, we organize the joints in a breadth-first search (BFS) order, enabling the skeleton to be defined as a sequence of 3D locations and the parent index. Furthermore, our model improves the accuracy of position prediction by leveraging diffusion modeling, ensuring precise and consistent placement of joints within the hierarchy. This formulation allows the autoregressive model to efficiently capture both spatial and hierarchical relationships within the skeleton. Trained end-to-end on both RigNet and Objaverse datasets, RigAnything demonstrates state-of-the-art performance across diverse object types, including humanoids, quadrupeds, marine creatures, insects, and many more, surpassing prior methods in quality, robustness, generalizability, and efficiency. Please check our website for more details: https://www.liuisabella.com/RigAnything.
Chinese: RigAnything 是一种基于自回归变换器的新模型,无需模板即可生成3D资产的关节、骨架拓扑和蒙皮权重,通过扩散模型和广度优先搜索排序,在多种物体类型上实现了领先的泛化性和精度。
English: RigAnything is an autoregressive transformer model that generates 3D rigging components—joints, skeleton topologies, and skinning weights—without templates, achieving state-of-the-art performance across diverse object categories by leveraging diffusion modeling and BFS ordering for hierarchical accuracy.

Authors:Minghong Wu, Minghui Liwang, Yuhan Su, Li Li, Seyyedali Hosseinalipour, Xianbin Wang, Huaiyu Dai, Zhenzhen Jiao
Title: Towards Seamless Hierarchical Federated Learning under Intermittent Client Participation: A Stagewise Decision-Making Methodology
Abstract:
Federated Learning (FL) offers a pioneering distributed learning paradigm that enables devices/clients to build a shared global model. This global model is obtained through frequent model transmissions between clients and a central server, which may cause high latency, energy consumption, and congestion over backhaul links. To overcome these drawbacks, Hierarchical Federated Learning (HFL) has emerged, which organizes clients into multiple clusters and utilizes edge nodes (e.g., edge servers) for intermediate model aggregations between clients and the central server. Current research on HFL mainly focus on enhancing model accuracy, latency, and energy consumption in scenarios with a stable/fixed set of clients. However, addressing the dynamic availability of clients -- a critical aspect of real-world scenarios -- remains underexplored. This study delves into optimizing client selection and client-to-edge associations in HFL under intermittent client participation so as to minimize overall system costs (i.e., delay and energy), while achieving fast model convergence. We unveil that achieving this goal involves solving a complex NP-hard problem. To tackle this, we propose a stagewise methodology that splits the solution into two stages, referred to as Plan A and Plan B. Plan A focuses on identifying long-term clients with high chance of participation in subsequent model training rounds. Plan B serves as a backup, selecting alternative clients when long-term clients are unavailable during model training rounds. This stagewise methodology offers a fresh perspective on client selection that can enhance both HFL and conventional FL via enabling low-overhead decision-making processes. Through evaluations on MNIST and CIFAR-10 datasets, we show that our methodology outperforms existing benchmarks in terms of model accuracy and system costs.
中文: 分层联邦学习通过提出两阶段客户端选择方法解决动态可用性问题,在最小化系统成本的同时加速模型收敛,在MNIST和CIFAR-10数据集上表现优于现有基准。
English: Hierarchical Federated Learning addresses dynamic client availability by proposing a two-stage client selection method that minimizes system costs while accelerating model convergence, outperforming benchmarks on MNIST and CIFAR-10 datasets.

Authors:Ciyuan Peng, Yuelong Huang, Qichao Dong, Shuo Yu, Feng Xia, Chengqi Zhang, Yaochu Jin
Title: Biologically Plausible Brain Graph Transformer
Abstract:
State-of-the-art brain graph analysis methods fail to fully encode the small-world architecture of brain graphs (accompanied by the presence of hubs and functional modules), and therefore lack biological plausibility to some extent. This limitation hinders their ability to accurately represent the brain's structural and functional properties, thereby restricting the effectiveness of machine learning models in tasks such as brain disorder detection. In this work, we propose a novel Biologically Plausible Brain Graph Transformer (BioBGT) that encodes the small-world architecture inherent in brain graphs. Specifically, we present a network entanglement-based node importance encoding technique that captures the structural importance of nodes in global information propagation during brain graph communication, highlighting the biological properties of the brain structure. Furthermore, we introduce a functional module-aware self-attention to preserve the functional segregation and integration characteristics of brain graphs in the learned representations. Experimental results on three benchmark datasets demonstrate that BioBGT outperforms state-of-the-art models, enhancing biologically plausible brain graph representations for various brain graph analytical tasks
中文: 提出的生物合理脑图变换器通过基于网络纠缠的节点重要性编码和功能模块感知自注意力机制,有效捕捉脑图的小世界架构特性,在多项脑图分析任务中优于现有先进方法。
English: The proposed Biologically Plausible Brain Graph Transformer (BioBGT) overcomes limitations of existing methods by encoding the small-world architecture of brain graphs through network entanglement-based node importance and functional module-aware self-attention, demonstrating superior performance in brain graph analytical tasks.

Authors:Jingxin Xu, Guoshun Nan, Sheng Guan, Sicong Leng, Yilian Liu, Zixiao Wang, Yuyang Ma, Zhili Zhou, Yanzhao Hou, Xiaofeng Tao
Title: Refining Positive and Toxic Samples for Dual Safety Self-Alignment of LLMs with Minimal Human Interventions
Abstract:
Recent AI agents, such as ChatGPT and LLaMA, primarily rely on instruction tuning and reinforcement learning to calibrate the output of large language models (LLMs) with human intentions, ensuring the outputs are harmless and helpful. Existing methods heavily depend on the manual annotation of high-quality positive samples, while contending with issues such as noisy labels and minimal distinctions between preferred and dispreferred response data. However, readily available toxic samples with clear safety distinctions are often filtered out, removing valuable negative references that could aid LLMs in safety alignment. In response, we propose PT-ALIGN, a novel safety self-alignment approach that minimizes human supervision by automatically refining positive and toxic samples and performing fine-grained dual instruction tuning. Positive samples are harmless responses, while toxic samples deliberately contain extremely harmful content, serving as a new supervisory signals. Specifically, we utilize LLM itself to iteratively generate and refine training instances by only exploring fewer than 50 human annotations. We then employ two losses, i.e., maximum likelihood estimation (MLE) and fine-grained unlikelihood training (UT), to jointly learn to enhance the LLM's safety. The MLE loss encourages an LLM to maximize the generation of harmless content based on positive samples. Conversely, the fine-grained UT loss guides the LLM to minimize the output of harmful words based on negative samples at the token-level, thereby guiding the model to decouple safety from effectiveness, directing it toward safer fine-tuning objectives, and increasing the likelihood of generating helpful and reliable content. Experiments on 9 popular open-source LLMs demonstrate the effectiveness of our PT-ALIGN for safety alignment, while maintaining comparable levels of helpfulness and usefulness.
中文摘要:PT-ALIGN方法提出了一种自我对齐方案,通过自动优化正负样本并采用双重指令微调,在极少人工标注下提升大语言模型的安全性,同时保持其帮助性。
English Summary: The PT-ALIGN method introduces a self-alignment approach that automatically refines positive and toxic samples with minimal human annotation, using dual instruction tuning to enhance LLM safety while preserving helpfulness.

Authors:Houyi Qi, Minghui Liwang, Xianbin Wang, Liqun Fu, Yiguang Hong, Li Li, Zhipeng Cheng
Title: Accelerating Stable Matching between Workers and Spatial-Temporal Tasks for Dynamic MCS: A Stagewise Service Trading Approach
Abstract:
Designing effective incentive mechanisms in mobile crowdsensing (MCS) networks is crucial for engaging distributed mobile users (workers) to contribute heterogeneous data for various applications (tasks). In this paper, we propose a novel stagewise trading framework to achieve efficient and stable task-worker matching, explicitly accounting for task diversity (e.g., spatio-temporal limitations) and network dynamics inherent in MCS environments. This framework integrates both futures and spot trading stages. In the former, we introduce the \textbf{f}utures \textbf{t}rading-driven \textbf{s}table \textbf{m}atching and \textbf{p}re-\textbf{p}ath-\textbf{p}lanning mechanism (FT-SMP$^3$), which enables long-term task-worker assignment and pre-planning of workers' trajectories based on historical statistics and risk-aware analysis. In the latter, we develop the \textbf{s}pot \textbf{t}rading-driven \textbf{D}QN-based \textbf{p}ath \textbf{p}lanning and onsite \textbf{w}orker \textbf{r}ecruitment mechanism (ST-DP$^2$WR), which dynamically improves the practical utilities of tasks and workers by supporting real-time recruitment and path adjustment. We rigorously prove that the proposed mechanisms satisfy key economic and algorithmic properties, including stability, individual rationality, competitive equilibrium, and weak Pareto optimality. Extensive experiements further validate the effectiveness of our framework in realistic network settings, demonstrating superior performance in terms of service quality, computational efficiency, and decision-making overhead.
中文: 本文提出了一种新颖的分阶段交易框架,通过期货与现货交易相结合,实现了移动群智感知中高效稳定的任务-工作者匹配,解决了任务多样性和网络动态性问题,保证了关键经济属性,并在实际应用中展现出优越性能。
English: This paper introduces a novel stagewise trading framework for mobile crowdsensing that combines futures and spot trading to achieve efficient and stable task-worker matching, addressing task diversity and network dynamics while ensuring key economic properties and demonstrating superior performance in practical settings.

Authors:Houyi Qi, Minghui Liwang, Xianbin Wang, Liqun Fu, Yiguang Hong, Li Li, Zhipeng Cheng
Title: Accelerating Stable Matching between Workers and Spatial-Temporal Tasks for Dynamic MCS: A Stagewise Service Trading Approach
Abstract:
Designing effective incentive mechanisms in mobile crowdsensing (MCS) networks is crucial for engaging distributed mobile users (workers) to contribute heterogeneous data for various applications (tasks). In this paper, we propose a novel stagewise trading framework to achieve efficient and stable task-worker matching, explicitly accounting for task diversity (e.g., spatio-temporal limitations) and network dynamics inherent in MCS environments. This framework integrates both futures and spot trading stages. In the former, we introduce the \textbf{f}utures \textbf{t}rading-driven \textbf{s}table \textbf{m}atching and \textbf{p}re-\textbf{p}ath-\textbf{p}lanning mechanism (FT-SMP$^3$), which enables long-term task-worker assignment and pre-planning of workers' trajectories based on historical statistics and risk-aware analysis. In the latter, we develop the \textbf{s}pot \textbf{t}rading-driven \textbf{D}QN-based \textbf{p}ath \textbf{p}lanning and onsite \textbf{w}orker \textbf{r}ecruitment mechanism (ST-DP$^2$WR), which dynamically improves the practical utilities of tasks and workers by supporting real-time recruitment and path adjustment. We rigorously prove that the proposed mechanisms satisfy key economic and algorithmic properties, including stability, individual rationality, competitive equilibrium, and weak Pareto optimality. Extensive experiements further validate the effectiveness of our framework in realistic network settings, demonstrating superior performance in terms of service quality, computational efficiency, and decision-making overhead.
中文: 本文提出了一种新颖的分阶段交易框架,通过期货与现货交易相结合,实现了移动群智感知中高效稳定的任务-工作者匹配,解决了任务多样性和网络动态性问题,保证了关键经济属性,并在实际应用中展现出优越性能。
English: This paper introduces a novel stagewise trading framework for mobile crowdsensing that combines futures and spot trading to achieve efficient and stable task-worker matching, addressing task diversity and network dynamics while ensuring key economic properties and demonstrating superior performance in practical settings.

Authors:Jingbo Sun, Songjun Tu, Qichao Zhang, Ke Chen, Dongbin Zhao
Title: Salience-Invariant Consistent Policy Learning for Generalization in Visual Reinforcement Learning
Abstract:
Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning, where agents often overfit to the specific visual observations of the training environment. In unseen environments, distracting pixels may lead agents to extract representations containing task-irrelevant information. As a result, agents may deviate from the optimal behaviors learned during training, thereby hindering visual generalization.To address this issue, we propose the Salience-Invariant Consistent Policy Learning (SCPL) algorithm, an efficient framework for zero-shot generalization. Our approach introduces a novel value consistency module alongside a dynamics module to effectively capture task-relevant representations. The value consistency module, guided by saliency, ensures the agent focuses on task-relevant pixels in both original and perturbed observations, while the dynamics module uses augmented data to help the encoder capture dynamic- and reward-relevant representations. Additionally, our theoretical analysis highlights the importance of policy consistency for generalization. To strengthen this, we introduce a policy consistency module with a KL divergence constraint to maintain consistent policies across original and perturbed observations.Extensive experiments on the DMC-GB, Robotic Manipulation, and CARLA benchmarks demonstrate that SCPL significantly outperforms state-of-the-art methods in terms of generalization. Notably, SCPL achieves average performance improvements of 14\%, 39\%, and 69\% in the challenging DMC video hard setting, the Robotic hard setting, and the CARLA benchmark, respectively.Project Page: https://sites.google.com/view/scpl-rl.
中文: SCPL算法通过引入显著性引导的价值一致性模块和动态模块,使智能体专注于任务相关特征而非干扰像素,在多个基准测试中实现了显著的零样本泛化性能提升。
English: The SCPL algorithm addresses visual reinforcement learning's generalization challenge by using saliency-guided value consistency and dynamics modules to focus on task-relevant features, achieving significant performance improvements across multiple benchmarks.

Authors:Jasper Roe, Leon Furze, Mike Perkins
Title: GenAI as Digital Plastic: Understanding Synthetic Media Through Critical AI Literacy
Abstract:
This paper introduces the conceptual metaphor of 'digital plastic' as a framework for understanding the implications of Generative Artificial Intelligence (GenAI) content through a multiliteracies lens, drawing parallels with the properties of physical plastic. Similar to its physical counterpart, GenAI content offers possibilities for content creation and accessibility while potentially contributing to digital pollution and ecosystem degradation. Drawing on multiliteracies theory and Conceptual Metaphor Theory, we argue that Critical Artificial Intelligence Literacy (CAIL) must be integrated into educational frameworks to help learners navigate this synthetic media landscape. We examine how GenAI can simultaneously lower the barriers to creative and academic production while threatening to degrade digital ecosystems through misinformation, bias, and algorithmic homogenization. The digital plastic metaphor provides a theoretical foundation for understanding both the affordances and challenges of GenAI, particularly in educational contexts, where issues of equity and access remain paramount. Our analysis concludes that cultivating CAIL through a multiliteracies lens is vital for ensuring the equitable development of critical competencies across geographical and cultural contexts, especially for those disproportionately vulnerable to GenAI's increasingly disruptive effects worldwide.
本文提出"数字塑料"隐喻,通过多元识读理论阐释生成式人工智能兼具内容创新与数字污染的双重特性,主张将批判性AI素养融入教育体系以促进公平的数字生态发展。
This paper proposes the "digital plastic" metaphor to analyze Generative AI's dual capacity for creative empowerment and ecological harm, advocating for critical AI literacy education to foster equitable digital navigation.

Authors:Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan
Title: Adaptive kernel predictors from feature-learning infinite limits of neural networks
Abstract:
Previous influential work showed that infinite width limits of neural networks in the lazy training regime are described by kernel machines. Here, we show that neural networks trained in the rich, feature learning infinite-width regime in two different settings are also described by kernel machines, but with data-dependent kernels. For both cases, we provide explicit expressions for the kernel predictors and prescriptions to numerically calculate them. To derive the first predictor, we study the large-width limit of feature-learning Bayesian networks, showing how feature learning leads to task-relevant adaptation of layer kernels and preactivation densities. The saddle point equations governing this limit result in a min-max optimization problem that defines the kernel predictor. To derive the second predictor, we study gradient flow training of randomly initialized networks trained with weight decay in the infinite-width limit using dynamical mean field theory (DMFT). The fixed point equations of the arising DMFT defines the task-adapted internal representations and the kernel predictor. We compare our kernel predictors to kernels derived from lazy regime and demonstrate that our adaptive kernels achieve lower test loss on benchmark datasets.
中文: 研究表明,在无限宽度极限下,采用特征学习机制训练的神经网络可由数据依赖核的核机器描述,相比惰性训练机制,这种自适应核在基准数据集上实现了更低的测试损失。
English: This study demonstrates that neural networks trained in the rich, feature learning infinite-width regime can be described by kernel machines with data-dependent kernels, which outperform lazy regime kernels by achieving lower test loss on benchmark datasets.

Authors:Han Zhang, Songbo Hu, Zhecheng Yuan, Huazhe Xu
Title: DOGlove: Dexterous Manipulation with a Low-Cost Open-Source Haptic Force Feedback Glove
Abstract:
Dexterous hand teleoperation plays a pivotal role in enabling robots to achieve human-level manipulation dexterity. However, current teleoperation systems often rely on expensive equipment and lack multi-modal sensory feedback, restricting human operators' ability to perceive object properties and perform complex manipulation tasks. To address these limitations, we present DOGlove, a low-cost, precise, and haptic force feedback glove system for teleoperation and manipulation. DoGlove can be assembled in hours at a cost under 600 USD. It features a customized joint structure for 21-DoF motion capture, a compact cable-driven torque transmission mechanism for 5-DoF multidirectional force feedback, and a linear resonate actuator for 5-DoF fingertip haptic feedback. Leveraging action and haptic force retargeting, DOGlove enables precise and immersive teleoperation of dexterous robotic hands, achieving high success rates in complex, contact-rich tasks. We further evaluate DOGlove in scenarios without visual feedback, demonstrating the critical role of haptic force feedback in task performance. In addition, we utilize the collected demonstrations to train imitation learning policies, highlighting the potential and effectiveness of DOGlove. DOGlove's hardware and software system will be fully open-sourced at https://do-glove.github.io/.
中文: DOGlove 是一款低成本触觉反馈手套系统,通过多模态感官反馈实现灵巧机械手的精准遥操作,在复杂操控任务中表现优异,并展现出模仿学习应用的潜力。
English: DOGlove is a low-cost haptic feedback glove system that enables precise teleoperation of dexterous robotic hands through multi-modal sensory feedback, achieving high performance in complex manipulation tasks and demonstrating potential for imitation learning applications.

Authors:Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar, Yingru Liu, Jeff Hwang, Dangna Li, Yuhao Wang, Julian Chan, Yuan Huang, Zhizheng Wu, Mingbo Ma
Title: Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement
Abstract:
The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility. Audio samples are available at https://versavoice.github.io.
Chinese: Vevo是一种新型零样本语音模仿框架,通过自监督学习有效分离音色和风格,无需标注数据即可在语音转换和文本转语音任务中实现卓越性能。
English: Vevo is a novel zero-shot voice imitation framework that effectively disentangles timbre and style through self-supervised learning, achieving superior performance in voice conversion and text-to-speech tasks without annotated data.

Authors:Osman Tursun, Sinan Kalkan, Simon Denman, Clinton Fookes
Title: PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval
Abstract:
Zero-shot composed image retrieval (ZS-CIR) enables image search using a reference image and text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches face three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the Prompt Directional Vector (PDV), a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be publicly available.
中文: 提出的提示方向向量(PDV)是一种无需训练的增强方法,通过实现动态文本嵌入、基于语义迁移的组合图像嵌入以及多模态特征的优化融合,有效解决了零样本组合图像检索中的关键局限,显著提升了跨基准的检索性能。
English: The proposed Prompt Directional Vector (PDV) is a training-free enhancement that overcomes limitations in Zero-shot Composed Image Retrieval by enabling dynamic text embeddings, composed image embeddings through semantic transfer, and optimized fusion of multimodal features, significantly boosting retrieval performance across benchmarks.

Authors:Osman Tursun, Sinan Kalkan, Simon Denman, Clinton Fookes
Title: PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval
Abstract:
Zero-shot Composed Image Retrieval (ZS-CIR) enables image search using a reference image and a text prompt without requiring specialized text-image composition networks trained on large-scale paired data. However, current ZS-CIR approaches suffer from three critical limitations in their reliance on composed text embeddings: static query embedding representations, insufficient utilization of image embeddings, and suboptimal performance when fusing text and image embeddings. To address these challenges, we introduce the \textbf{Prompt Directional Vector (PDV)}, a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts. PDV enables three key improvements: (1) Dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings that enhances retrieval by balancing visual and semantic similarity. Our approach serves as a plug-and-play enhancement for existing ZS-CIR methods with minimal computational overhead. Extensive experiments across multiple benchmarks demonstrate that PDV consistently improves retrieval performance when integrated with state-of-the-art ZS-CIR approaches, particularly for methods that generate accurate compositional embeddings. The code will be released upon publication.
中文: 提出的提示方向向量(PDV)是一种无需训练的增强方法,通过实现动态文本嵌入、基于语义迁移的组合图像嵌入以及多模态特征的优化融合,有效解决了零样本组合图像检索中的关键局限,显著提升了跨基准的检索性能。
English: The proposed Prompt Directional Vector (PDV) is a training-free enhancement that overcomes limitations in Zero-shot Composed Image Retrieval by enabling dynamic text embeddings, composed image embeddings through semantic transfer, and optimized fusion of multimodal features, significantly boosting retrieval performance across benchmarks.

Authors:Alex Tong, Apoorva Sharma, Sushant Veer, Marco Pavone, Heng Yang
Title: Online Aggregation of Trajectory Predictors
Abstract:
Trajectory prediction, the task of forecasting future agent behavior from past data, is central to safe and efficient autonomous driving. A diverse set of methods (e.g., rule-based or learned with different architectures and datasets) have been proposed, yet it is often the case that the performance of these methods is sensitive to the deployment environment (e.g., how well the design rules model the environment, or how accurately the test data match the training data). Building upon the principled theory of online convex optimization but also going beyond convexity and stationarity, we present a lightweight and model-agnostic method to aggregate different trajectory predictors online. We propose treating each individual trajectory predictor as an "expert" and maintaining a probability vector to mix the outputs of different experts. Then, the key technical approach lies in leveraging online data -- the true agent behavior to be revealed at the next timestep -- to form a convex-or-nonconvex, stationary-or-dynamic loss function whose gradient steers the probability vector towards choosing the best mixture of experts. We instantiate this method to aggregate trajectory predictors trained on different cities in the NUSCENES dataset and show that it performs just as well, if not better than, any singular model, even when deployed on the out-of-distribution LYFT dataset.
Chinese: 本文提出了一种轻量级、模型无关的方法,通过将多个轨迹预测器视为专家,并基于实时数据在线优化调整其权重组合,从而在不同数据集上实现稳健的性能表现。
English: This paper introduces a lightweight, model-agnostic method that aggregates multiple trajectory predictors by treating them as experts and dynamically adjusting their weight mixture through online optimization based on real-time data, achieving robust performance across different datasets.

Authors:Abhinav Prakash Gahlot, Rafael Orozco, Felix J. Herrmann
Title: Advancing Geological Carbon Storage Monitoring With 3d Digital Shadow Technology
Abstract:
Geological Carbon Storage (GCS) is a key technology for achieving global climate goals by capturing and storing CO2 in deep geological formations. Its effectiveness and safety rely on accurate monitoring of subsurface CO2 migration using advanced time-lapse seismic imaging. A Digital Shadow framework integrates field data, including seismic and borehole measurements, to track CO2 saturation over time. Machine learning-assisted data assimilation techniques, such as generative AI and nonlinear ensemble Bayesian filtering, update a digital model of the CO2 plume while incorporating uncertainties in reservoir properties. Compared to 2D approaches, 3D monitoring enhances the spatial accuracy of GCS assessments, capturing the full extent of CO2 migration. This study extends the uncertainty-aware 2D Digital Shadow framework by incorporating 3D seismic imaging and reservoir modeling, improving decision-making and risk mitigation in CO2 storage projects.
中文: 地质碳储存(GCS)通过三维数字阴影框架,结合机器学习与地震成像,精确监测二氧化碳运移,从而提升碳储存决策的安全性与有效性。
English: Geological Carbon Storage (GCS) relies on a 3D Digital Shadow framework that integrates machine learning and seismic imaging to accurately monitor CO2 migration and enhance decision-making for safe carbon storage.

Authors:Sarah Laouedj, Yuzhe Wang, Jesus Villalba, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak
Title: Detecting Neurodegenerative Diseases using Frame-Level Handwriting Embeddings
Abstract:
In this study, we explored the use of spectrograms to represent handwriting signals for assessing neurodegenerative diseases, including 42 healthy controls (CTL), 35 subjects with Parkinson's Disease (PD), 21 with Alzheimer's Disease (AD), and 15 with Parkinson's Disease Mimics (PDM). We applied CNN and CNN-BLSTM models for binary classification using both multi-channel fixed-size and frame-based spectrograms. Our results showed that handwriting tasks and spectrogram channel combinations significantly impacted classification performance. The highest F1-score (89.8%) was achieved for AD vs. CTL, while PD vs. CTL reached 74.5%, and PD vs. PDM scored 77.97%. CNN consistently outperformed CNN-BLSTM. Different sliding window lengths were tested for constructing frame-based spectrograms. A 1-second window worked best for AD, longer windows improved PD classification, and window length had little effect on PD vs. PDM.
中文: 本研究利用手写信号频谱图和CNN模型有效分类神经退行性疾病,在阿尔茨海默病检测中达到89.8%的最高F1分数,并发现不同疾病的最佳分析窗口长度存在差异。
English: This study utilizes spectrograms of handwriting signals with CNN models to effectively classify neurodegenerative diseases, achieving the highest F1-score of 89.8% for Alzheimer's Disease detection and demonstrating optimal window lengths varying by disease type.

Authors:Wenhao Ding, Sushant Veer, Karen Leung, Yulong Cao, Marco Pavone
Title: Surprise Potential as a Measure of Interactivity in Driving Scenarios
Abstract:
Validating the safety and performance of an autonomous vehicle (AV) requires benchmarking on real-world driving logs. However, typical driving logs contain mostly uneventful scenarios with minimal interactions between road users. Identifying interactive scenarios in real-world driving logs enables the curation of datasets that amplify critical signals and provide a more accurate assessment of an AV's performance. In this paper, we present a novel metric that identifies interactive scenarios by measuring an AV's surprise potential on others. First, we identify three dimensions of the design space to describe a family of surprise potential measures. Second, we exhaustively evaluate and compare different instantiations of the surprise potential measure within this design space on the nuScenes dataset. To determine how well a surprise potential measure correctly identifies an interactive scenario, we use a reward model learned from human preferences to assess alignment with human intuition. Our proposed surprise potential, arising from this exhaustive comparative study, achieves a correlation of more than 0.82 with the human-aligned reward function, outperforming existing approaches. Lastly, we validate motion planners on curated interactive scenarios to demonstrate downstream applications.
Chinese: 本文提出了一种新颖的指标,通过测量自动驾驶车辆对其他道路使用者的意外潜在影响来识别交互场景,该方法与人类直觉高度一致,并在评估自动驾驶性能方面优于现有方法。
English: This paper introduces a novel metric for identifying interactive scenarios in autonomous vehicle driving logs by measuring surprise potential, which aligns closely with human intuition and outperforms existing methods in evaluating AV performance.

Authors:Jie Tan, Kangfei Zhao, Rui Li, Jeffrey Xu Yu, Chengzhi Piao, Hong Cheng, Helen Meng, Deli Zhao, Yu Rong
Title: Can Large Language Models Be Query Optimizer for Relational Databases?
Abstract:
Query optimization, which finds the optimized execution plan for a given query, is a complex planning and decision-making problem within the exponentially growing plan space in database management systems (DBMS). Traditional optimizers heavily rely on a certain cost model constructed by various heuristics and empirical tuning, probably leading to generating suboptimal plans. Recent developments of Large Language Models (LLMs) have demonstrated their potential in solving complex planning and decision-making problems, such as arithmetic and programmatic tasks. In this paper, we try to explore the potential of LLMs in handling query optimization and propose a tentative LLM-based query optimizer dubbed LLM-QO, established on PostgreSQL's execution engine. In LLM-QO, we formulate query optimization in an autoregressive fashion which directly generates the execution plan without explicit plan enumeration. To investigate the essential input of LLM-QO, we design a customized data recipe named QInstruct to collect the training data from various optimizers and serialize the database's meta data, queries and corresponding plans into a textual format. Based on QInstruct, we implement a two-stage fine-tuning pipeline, Query Instruction Tuning (QIT) and Query Direct Preference Optimization (QDPO), to empower the capability of general-purpose LLMs in handling query optimization. In our experiments, LLM-QO can generate valid and high-quality plans and consistently outperforms both traditional and learned optimizers on three query workloads. Our findings verify that LLMs can be derived as query optimizers where generalization, efficiency and adaptivity deserve further research efforts.
中文: 本文提出基于大语言模型的查询优化器LLM-QO,它采用自回归方式直接生成执行计划,在三个查询工作负载上均优于传统和基于学习的优化器,验证了大语言模型作为查询优化器的潜力。
English: This paper introduces LLM-QO, a novel query optimizer based on Large Language Models that formulates query optimization autoregressively and outperforms traditional and learned optimizers by generating valid, high-quality execution plans without explicit enumeration.

Authors:Hanzhi Yu, Yuchen Liu, Zhaohui Yang, Haijian Sun, Mingzhe Chen
Title: Optimizing Wireless Resource Management and Synchronization in Digital Twin Networks
Abstract:
In this paper, we investigate an accurate synchronization between a physical network and its digital network twin (DNT), which serves as a virtual representation of the physical network. The considered network includes a set of base stations (BSs) that must allocate its limited spectrum resources to serve a set of users while also transmitting its partially observed physical network information to a cloud server to generate the DNT. Since the DNT can predict the physical network status based on its historical status, the BSs may not need to send their physical network information at each time slot, allowing them to conserve spectrum resources to serve the users. However, if the DNT does not receive the physical network information of the BSs over a large time period, the DNT's accuracy in representing the physical network may degrade. To this end, each BS must decide when to send the physical network information to the cloud server to update the DNT, while also determining the spectrum resource allocation policy for both DNT synchronization and serving the users. We formulate this resource allocation task as an optimization problem, aiming to maximize the total data rate of all users while minimizing the asynchronization between the physical network and the DNT. To address this problem, we propose a method based on the GRUs and the value decomposition network (VDN). Simulation results show that our GRU and VDN based algorithm improves the weighted sum of data rates and the similarity between the status of the DNT and the physical network by up to 28.96%, compared to a baseline method combining GRU with the independent Q learning.
Chinese: 本研究通过基站策略性地传输网络数据以更新数字孪生网络,同时分配频谱资源以最大化用户数据速率并最小化异步,采用基于GRU和VDN的方法,将性能提升高达28.96%。
English: This study optimizes the synchronization between a physical network and its digital twin by enabling base stations to strategically transmit network data for updates while allocating spectrum resources to maximize user data rates and minimize asynchrony, using a GRU and VDN-based method that improves performance by up to 28.96%.

Authors:Alexander Atanasov, Blake Bordelon, Jacob A. Zavatone-Veth, Courtney Paquette, Cengiz Pehlevan
Title: Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Abstract:
We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and random feature models. Our results include previously known asymptotics as well as novel ones.
中文: 本研究提出了随机矩阵解析函数的新确定性等价关系,并统一推导了多种高维线性模型在随机梯度下降训练下的性能表现,涵盖了已知与新颖的渐近结果。
English: This study establishes a new deterministic equivalence for random matrix resolvents and provides a unified derivation of stochastic gradient descent performance across multiple high-dimensional linear models, extending both known and new asymptotic results.

Authors:Zhao-Heng Yin, Changhao Wang, Luis Pineda, Francois Hogan, Krishna Bodduluri, Akash Sharma, Patrick Lancaster, Ishita Prasad, Mrinal Kalakrishnan, Jitendra Malik, Mike Lambeta, Tingfan Wu, Pieter Abbeel, Mustafa Mukadam
Title: DexterityGen: Foundation Controller for Unprecedented Dexterity
Abstract:
Teaching robots dexterous manipulation skills, such as tool use, presents a significant challenge. Current approaches can be broadly categorized into two strategies: human teleoperation (for imitation learning) and sim-to-real reinforcement learning. The first approach is difficult as it is hard for humans to produce safe and dexterous motions on a different embodiment without touch feedback. The second RL-based approach struggles with the domain gap and involves highly task-specific reward engineering on complex tasks. Our key insight is that RL is effective at learning low-level motion primitives, while humans excel at providing coarse motion commands for complex, long-horizon tasks. Therefore, the optimal solution might be a combination of both approaches. In this paper, we introduce DexterityGen (DexGen), which uses RL to pretrain large-scale dexterous motion primitives, such as in-hand rotation or translation. We then leverage this learned dataset to train a dexterous foundational controller. In the real world, we use human teleoperation as a prompt to the controller to produce highly dexterous behavior. We evaluate the effectiveness of DexGen in both simulation and real world, demonstrating that it is a general-purpose controller that can realize input dexterous manipulation commands and significantly improves stability by 10-100x measured as duration of holding objects across diverse tasks. Notably, with DexGen we demonstrate unprecedented dexterous skills including diverse object reorientation and dexterous tool use such as pen, syringe, and screwdriver for the first time.
中文摘要:DexGen将强化学习的底层运动技能与人类遥操作的高层指导相结合,开发出通用控制器,在复杂操作任务中显著提升了机器人的灵巧性和稳定性。
English Summary: DexGen combines reinforcement learning for low-level motion primitives with human teleoperation for high-level guidance, creating a general-purpose controller that significantly enhances robotic dexterity and stability in complex manipulation tasks.

Authors:Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye Hao, Mingxuan Yuan, Bin Li
Title: AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference
Abstract:
With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through heuristic ranking with attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the \textit{temporal patterns} in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based critical token identification approach. Specifically, AttentionPredictor learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature of AttentionPredictor is that it accurately predicts the attention score while consuming negligible memory. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves 16$\times$ KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.
中文总结:AttentionPredictor 提出了一种基于学习的方法,通过轻量级卷积模型预测注意力分数,实现 16 倍的 KV 缓存压缩,在保持大语言模型性能的同时显著提升了效率。
English Summary: AttentionPredictor introduces a learning-based method using a lightweight convolution model to predict attention scores and compress the KV cache by 16×, significantly improving efficiency while maintaining LLM performance.

Authors:Yunbo Long, Liming Xu, Alexandra Brintrup
Title: Evaluating Inter-Column Logical Relationships in Synthetic Tabular Data Generation
Abstract:
Current evaluations of synthetic tabular data mainly focus on how well joint distributions are modeled, often overlooking the assessment of their effectiveness in preserving realistic event sequences and coherent entity relationships across columns.This paper proposes three evaluation metrics designed to assess the preservation of logical relationships among columns in synthetic tabular data. We validate these metrics by assessing the performance of both classical and state-of-the-art generation methods on a real-world industrial dataset.Experimental results reveal that existing methods often fail to rigorously maintain logical consistency (e.g., hierarchical relationships in geography or organization) and dependencies (e.g., temporal sequences or mathematical relationships), which are crucial for preserving the fine-grained realism of real-world tabular data. Building on these insights, this study also discusses possible pathways to better capture logical relationships while modeling the distribution of synthetic tabular data.
中文摘要:本文提出三种评估指标来衡量合成表格数据中逻辑关系的保持情况,发现现有方法常无法维持真实的序列和依赖关系,并探讨了改进建模的潜在途径。
English Summary: This paper introduces three metrics to evaluate the preservation of logical relationships in synthetic tabular data, revealing that current methods often fail to maintain realistic sequences and dependencies, and suggests improvements for future models.

Authors:Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao
Title: Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Abstract:
Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.
中文: 本文提出Twilight框架,通过将top-p采样融入稀疏注意力,为长上下文大语言模型实现自适应稀疏化,能在不损失精度的情况下最高剪除98%的冗余标记,显著加速自注意力计算和端到端解码过程。
English: This paper introduces Twilight, a framework that adaptively applies sparsity to long-context LLMs by integrating top-p sampling into sparse attention, achieving up to 98% token pruning and significant acceleration in both self-attention and end-to-end decoding without compromising accuracy.

Authors:Blake Bordelon, Cengiz Pehlevan
Title: Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer
Abstract:
We theoretically characterize gradient descent dynamics in deep linear networks trained at large width from random initialization and on large quantities of random data. Our theory captures the ``wider is better" effect of mean-field/maximum-update parameterized networks as well as hyperparameter transfer effects, which can be contrasted with the neural-tangent parameterization where optimal learning rates shift with model width. We provide asymptotic descriptions of both non-residual and residual neural networks, the latter of which enables an infinite depth limit when branches are scaled as $1/\sqrt{\text{depth}}$. We also compare training with one-pass stochastic gradient descent to the dynamics when training data are repeated at each iteration. Lastly, we show that this model recovers the accelerated power law training dynamics for power law structured data in the rich regime observed in recent works.
Chinese: 本研究从理论上分析了宽深度线性网络中的梯度下降动态,揭示了网络宽度与参数化方式对学习过程的影响,包括超参数迁移效应及在结构化数据上的加速训练特性。
English: This study theoretically analyzes gradient descent in wide deep linear networks, revealing how width and parameterization affect learning dynamics, including hyperparameter transfer and accelerated training on structured data.

Authors:Hanjun Kim, Minwoo Jung, Chiyun Noh, Sangwoo Jung, Hyunho Song, Wooseong Yang, Hyesu Jang, Ayoung Kim
Title: HeRCULES: Heterogeneous Radar Dataset in Complex Urban Environment for Multi-session Radar SLAM
Abstract:
Recently, radars have been widely featured in robotics for their robustness in challenging weather conditions. Two commonly used radar types are spinning radars and phased-array radars, each offering distinct sensor characteristics. Existing datasets typically feature only a single type of radar, leading to the development of algorithms limited to that specific kind. In this work, we highlight that combining different radar types offers complementary advantages, which can be leveraged through a heterogeneous radar dataset. Moreover, this new dataset fosters research in multi-session and multi-robot scenarios where robots are equipped with different types of radars. In this context, we introduce the HeRCULES dataset, a comprehensive, multi-modal dataset with heterogeneous radars, FMCW LiDAR, IMU, GPS, and cameras. This is the first dataset to integrate 4D radar and spinning radar alongside FMCW LiDAR, offering unparalleled localization, mapping, and place recognition capabilities. The dataset covers diverse weather and lighting conditions and a range of urban traffic scenarios, enabling a comprehensive analysis across various environments. The sequence paths with multiple revisits and ground truth pose for each sensor enhance its suitability for place recognition research. We expect the HeRCULES dataset to facilitate odometry, mapping, place recognition, and sensor fusion research. The dataset and development tools are available at https://sites.google.com/view/herculesdataset.
中文摘要:HeRCULES数据集首次整合了4D雷达、旋转雷达和FMCW激光雷达,为不同天气和光照条件下的定位、建图与位置识别研究提供了前所未有的多模态数据支持。
English Summary: The HeRCULES dataset is introduced as the first to combine 4D radar, spinning radar, and FMCW LiDAR, enabling enhanced research in localization, mapping, and place recognition across varied environmental conditions.